Sidenotes on Lempel-Ziv Quine

under Blog

I read Russ Cox(Cox 2010)'s solution to the self-reproducing zip file problem today. I have to admit that, due to my lack of intelligence I guess, I don't understand the proof of the solution very well. Thus, I'll post here some clarification on the solution I have to myself, for those like me who get confused about the solution.

I'm going to jump directly into the final Lempel-Ziv program, and I won't talk about the actual encoding of the zip file, which makes the problem harder.

Problem rephrasing

Suppose we have the following language, where the minimal unit is one code:

literal, Lx: following by an arbitrary array of codes of length x, prints the array into the output as-is. That is, L3 aa bb cc results as aa bb cc in the output.
repeat, Rx: prints the last x codes in the output into the output again. That is, if the output is aa bb cc dd at that moment, R3 turns it into aa bb cc dd bb cc dd.
No-op, L0: Notice that as a special case for repeat, L0, is no-op.
(no term): Any other "illegal" codes, i.e. not in the literal array following Lx and is not Lx or Rx, are considered comments. That is, they will not be printed to the output. And they can only appears at the beginning or at the end.

Then, given a p-code long prefix [P], and an s-code long suffix [S], can we find a quine prefixing with [P] and suffixing with [S]? that is, if we note the process of decoding as f() then can we find a string of codes [C] such that (++ means concatenation here)

f([P] ++ [C] ++ [S]) = [P] ++ [C] ++ [S]

Actions

We say that we have two array of codes, Code and Output, and, assume that eventually we have Code == Output, we say that the shorter one at a moment is "left behind" for the sub-string [X].

OutputCatchUp([X]): for an x-code-long string [X], this action can be invoked when the Output is left behind for [X], and as a result the Output is left behind for Rx+1. The actual encoding for this action is Lx+1 [X] Lx+1 Rx+1, and the output is [X] Lx+1 [X] Lx+1.
Swap(): This action can be invoked when the Output is left behind for R4, and as a result the Code is left behind for L4, like a swap. This is a magic fixed string L4 R4 L4 R4 L4 R4, and the output is R4 L4 R4 L4 R4 L4 R4 L4.
CodeWait([X]): for an x-code-long-string [X], this action can be invoked when the Code is left behind for Lx+1, and as a result the Code is left behind for [X]. The actual encoding for this action is Lx+1 Rx+1 [X] Rx+1, and the output is Rx+1 [X] Rx+1 [X]. Note that Swap() is a special case of this, where [X] = L4 R4 L4.
OutputHold(X): for a single code X, this action can be invoked when Output is left behind for X, and as a result Output is left behind (further) for X L1 L1. The actual encoding for this action is L1 X L1 L1, and the output is X L1.
CodeLift(n): For a length n, this action can be invoked when Code is left behind for L4, and as a result the Code is left behind for Ln. The actual encoding for this action is L4 R4 L0 L0 Ln R4 L0 L0, and the output is R4 L0 L0 Ln R4 L0 L0 Ln.

Properties

When Code is left behind, the only thing it can do is to "catch up" by filling the diff sub-string compared with Output until it is no longer left behind, i.e. it is the same as Output (which means we find the solution) or the Output is left behind. In other words, the set of actions can be performed at each moment Code left behind is fixed. The crucial core to the solution is to build this set into the one we want.
OutputCatchUp([X]) basically should not be called when [X]'s length x is 1, resulting in an hard-to-break loop.

Solution demystified

To begin with, filling the prefix comment [P], Output is left behind for [P]
Apply OutputCatchUp([P]), now the Output is left behind for Rp+1
Apply OutputHold(Rp+1), now the Output is left behind for Rp+1 L1 L1
Apply OutputCatchUp(Rp+1 L1 L1), now the Output is left behind for R4
Apply Swap(), now the Code is left behind for L4
Apply CodeLift(s+1), now the Code is left behind for Ls+1
Apply CodeWait([S]), now the Code is left behind for [S]
Fill the suffix comment [S]. We are done!

Notice that

Before Swap() all actions require Output to be left behind, and after that all actions require Code to be left behind.
The core idea before Swap() is to build Swap()'s input requirement, i.e lacking R4 on Output somehow, and get rid of the interference of the initial prefix comment as soon as possible.
The core idea after Swap() is to build the moment requiring Ls+1 on Code. Only then can [S] be printed.

Reference

Cox, Russ. 2010. “Zip Files All the Way down.” March 18, 2010. https://research.swtch.com/zip.