OverTheWireAdvent ChristmaSSE KeyGen
- Tags: rev math
- Points: 271
- Solves: 45
I ran this program but it never finished… maybe my computer is too slow. Maybe yours is faster?
Author: hpmv
Solution
Summary
- Reverse engineer to understand the algorithm
- the program uses a lot of SSE instructions, and we had to manually reverse it
- the algorithm found to be matrix multiplication with modulo
- Rewrite the program with an efficient algorithm
- Power of matrix: efficiently compute when exponent is large
Details
Reverse engineer to understand the algorithm
When we try to run the program, the program does not terminate. This is because the program is computing something in an inefficient way. We cannot speed it up by simply optimising it, but we have to understand the meaning of the program and rewrite it with more efficient algorithm.
The program has a lot of SSE instructions, as the name of the challenge implies. First we tried to decompile it with Ghidra or other reverse engineering tools, but it doesn’t work well.
The program structure is shown below. There are outer loop and inner loop.
0000000000400570 <main>:
400570: 66 0f 6f 05 08 0b 20 movdqa xmm0,XMMWORD PTR [rip+0x200b08] # 601080 <data+0x40>
400577: 00
400578: 66 0f 70 d8 54 pshufd xmm3,xmm0,0x54
...
40058e: 48 b9 15 81 e9 7d f4 movabs rcx,0x112210f47de98115
400595: 10 22 11
### outer loop
.outer_loop:
4005a0: 0f ae f8 sfence
4005a3: 66 44 0f 6f 0d 94 0a movdqa xmm9,XMMWORD PTR [rip+0x200a94]
4005aa: 20 00
4005ac: 66 41 0f 70 f1 00 pshufd xmm6,xmm9,0x0 # 0b00000000
4005b2: 66 0f 38 40 f3 pmulld xmm6,xmm3
...
.inner_loop:
4006d0: 66 0f 6f f4 movdqa xmm6,xmm4
4006d4: 66 0f 66 f3 pcmpgtd xmm6,xmm3
...
400724: 83 c2 ff add edx,0xffffffff
400727: 75 a7 jne .inner_loop
400729: 48 83 c0 01 add rax,0x1
40072d: 48 39 c8 cmp rax,rcx
400730: 0f 85 6a fe ff ff jne .outer_loop
400736: 48 83 ec 18 sub rsp,0x18
...
We ran the program with debugger to see internal state (xmm0, …, xmm3) for each round, in order to find out the recurrence formula. The main operation seems to be done in outer loop, and inner loop doesn’t seem to do anything, at least for the first several iterations. The output from outer loop looks like:
1st iteration:
xmm3 = {0x1, 0x2, 0x3, 0x4},
xmm2 = {0x5, 0x6, 0x7, 0x8},
xmm1 = {0x9, 0xa, 0xb, 0xc},
xmm0 = {0xd, 0xe, 0xf, 0x10},
2nd iteration:
xmm3 = {0x5a, 0x64, 0x6e, 0x78},
xmm2 = {0xca, 0xe4, 0xfe, 0x118},
xmm1 = {0x13a, 0x164, 0x18e, 0x1b8},
xmm0 = {0x1aa, 0x1e4, 0x21e, 0x258},
3rd iteration:
xmm3 = {0xc44, 0xde8, 0xf8c, 0x1130},
xmm2 = {0x1c64, 0x2028, 0x23ec, 0x27b0},
xmm1 = {0x2c84, 0x3268, 0x384c, 0x3e30},
xmm0 = {0x3ca4, 0x44a8, 0x4cac, 0x54b0},
4th iteration:
xmm3 = {0x1bd28, 0x1f810, 0x232f8, 0x26de0},
xmm2 = {0x40468, 0x48c90, 0x514b8, 0x59ce0},
xmm1 = {0x64ba8, 0x72110, 0x7f678, 0x8cbe0},
xmm0 = {0x892e8, 0x9b590, 0xad838, 0xbfae0},
...
Here we gave up finding the rule, and began to analyse the code.
movdqa
: move data into xmm registerpshufd
: select 32bit elements
pshufd
is a bit complicated instruction, but we can understand the instruction this way:
If we have pshufd xmm1, xmm2, 0x36
, then convert 0x36
to binary form 0b00110110
, then separate it for each 2 bits (00
11
01
10
), and use it as source register index as follows:
xmm1[0] := xmm2[0b00 = 0]
xmm1[1] := xmm2[0b11 = 3]
xmm1[2] := xmm2[0b01 = 1]
xmm1[3] := xmm2[0b10 = 2]
Then we annotate the meanings of each instruction as follows:
0000000000400570 <main>:
400570: 66 0f 6f 05 08 0b 20 movdqa xmm0,XMMWORD PTR [rip+0x200b08] # 601080 <data+0x40>
400577: 00
xmm0 = {0x1, 0x0, 0x96433d, 0x0}
400578: 66 0f 70 d8 54 pshufd xmm3,xmm0,0x54 # 0x01010100
xmm3 = {xmm0[0], xmm0[1], xmm0[1], xmm0[1]} == {0x1, 0x0, 0x0, 0x0}
40057d: 66 0f 70 d0 51 pshufd xmm2,xmm0,0x51 # 0x01010001
xmm2 = {xmm0[1], xmm0[0], xmm0[1], xmm0[1]} == {0x0, 0x1, 0x0, 0x0}
400582: 66 0f 70 c8 45 pshufd xmm1,xmm0,0x45 # 0x01000101
xmm1 = {xmm0[1], xmm0[1], xmm0[0], xmm0[1]} == {0x0, 0x0, 0x1, 0x0}
400587: 66 0f 70 c0 15 pshufd xmm0,xmm0,0x15 # 0x00010101
xmm0 = {xmm0[1], xmm0[1], xmm0[1], xmm0[0]} == {0x0, 0x0, 0x0, 0x1}
40058c: 31 c0 xor eax,eax
eax = 0
40058e: 48 b9 15 81 e9 7d f4 movabs rcx,0x112210f47de98115
400595: 10 22 11
rcx = 1234567890123456789
400598: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
40059f: 00
### outer loop effectively computes "matrix multiplication"
L1:
4005a0: 0f ae f8 sfence
4005a3: 66 44 0f 6f 0d 94 0a movdqa xmm9,XMMWORD PTR [rip+0x200a94] # 601040 <data>
4005aa: 20 00
xmm9 = {0x1, 0x2, 0x3, 0x4}
4005ac: 66 41 0f 70 f1 00 pshufd xmm6,xmm9,0x0 # 0b00000000
xmm6[i] = 1 # xmm9[0]
4005b2: 66 0f 38 40 f3 pmulld xmm6,xmm3
xmm6[i] = 1 * xmm3[i]
4005b7: 66 45 0f 70 e1 55 pshufd xmm12,xmm9,0x55 # 0b01010101
xmm12[i] = 2 # xmm9[1]
4005bd: 66 44 0f 38 40 e2 pmulld xmm12,xmm2
xmm12[i] = 2 * xmm2[i]
4005c3: 66 44 0f fe e6 paddd xmm12,xmm6
xmm12[i] = 2 * xmm2[i] + 1 * xmm3[i]
4005c8: 66 44 0f 6f 15 7f 0a movdqa xmm10,XMMWORD PTR [rip+0x200a7f] # 601050 <data+0x10>
4005cf: 20 00
xmm10 = {0x5, 0x6, 0x7, 0x8}
4005d1: 66 45 0f 70 c2 00 pshufd xmm8,xmm10,0x0 # 0b00000000
xmm8[i] = 5 # xmm10[0]
4005d7: 66 44 0f 38 40 c3 pmulld xmm8,xmm3
xmm8[i] = 5 * xmm3[i] for i in range(0, 4)
4005dd: 66 44 0f 6f 2d 7a 0a movdqa xmm13,XMMWORD PTR [rip+0x200a7a] # 601060 <data+0x20>
4005e4: 20 00
xmm13 = {0x9, 0xa, 0xb, 0xc}
4005e6: 66 45 0f 70 dd 00 pshufd xmm11,xmm13,0x0 # 0b00000000
xmm11 = 9 # xmm13[0]
4005ec: 66 44 0f 38 40 db pmulld xmm11,xmm3
xmm11[i] = 9 * xmm3[i]
4005f2: 66 44 0f 6f 3d 75 0a movdqa xmm15,XMMWORD PTR [rip+0x200a75] # 601070 <data+0x30>
4005f9: 20 00
xmm15 = {0xd, 0xe, 0xf, 0x10}
4005fb: 66 45 0f 70 f7 00 pshufd xmm14,xmm15,0x0 # 0b00000000
xmm14 = 13 # xmm15[0]
400601: 66 44 0f 38 40 f3 pmulld xmm14,xmm3
xmm14[i] = 13 * xmm3[i]
400607: 66 41 0f 70 e9 aa pshufd xmm5,xmm9,0xaa # 0b10101010
xmm5 = 3 # xmm9[2]
40060d: 66 0f 38 40 e9 pmulld xmm5,xmm1
xmm5[i] = 3 * xmm1[i]
400612: 66 41 0f 70 d9 ff pshufd xmm3,xmm9,0xff # 0b11111111
xmm3 = 4 # xmm9[3]
400618: 66 0f 38 40 d8 pmulld xmm3,xmm0
xmm3[i] = 4 * xmm0[i]
40061d: 66 0f fe dd paddd xmm3,xmm5
xmm3[i] = 4 * xmm0[i] + 3 * xmm1[i]
400621: 66 41 0f fe dc paddd xmm3,xmm12
xmm3[i] = 4 * xmm0[i] + 3 * xmm1[i] + 2 * xmm2[i] + 1 * xmm3[i]
400626: 66 41 0f 70 ea 55 pshufd xmm5,xmm10,0x55 # 0b01010101
xmm5 = {xmm10[1], xmm10[i], xmm10[i], xmm10[i]} == {0x6, 0x6, 0x6, 0x6}
40062c: 66 0f 38 40 ea pmulld xmm5,xmm2
xmm5[i] = 6 * xmm2[i]
400631: 66 41 0f fe e8 paddd xmm5,xmm8
xmm5[i] = 6 * xmm2[i] + 5 * xmm3[i]
400636: 66 41 0f 70 fd 55 pshufd xmm7,xmm13,0x55 # 0b01010101
xmm7[i] = 10 # xmm13[1]
40063c: 66 0f 38 40 fa pmulld xmm7,xmm2
xmm7[i] = 10 * xmm2[i]
400641: 66 41 0f 70 e7 55 pshufd xmm4,xmm15,0x55 # 0b01010101
xmm4[i] = 14 # xmm15[i]
400647: 66 0f 38 40 e2 pmulld xmm4,xmm2
xmm4[i] = 14 * xmm2[i]
40064c: 66 41 0f 70 f2 aa pshufd xmm6,xmm10,0xaa # 0b10101010
xmm6[i] = 7 # xmm10[2]
400652: 66 0f 38 40 f1 pmulld xmm6,xmm1
xmm6[i] = 7 * xmm1[i]
400657: 66 41 0f 70 d2 ff pshufd xmm2,xmm10,0xff # 0b11111111
xmm2[i] = 8 # xmm10[3]
40065d: 66 0f 38 40 d0 pmulld xmm2,xmm0
xmm2[i] = 8 * xmm0[i]
400662: 66 0f fe d6 paddd xmm2,xmm6
xmm2[i] = 8 * xmm0[i] + 7 * xmm1[i]
400666: 66 0f fe d5 paddd xmm2,xmm5
xmm2[i] = 8 * xmm0[i] + 7 * xmm1[i] + 6 * xmm2[i] + 5 * xmm3[i]
40066a: 66 41 0f fe fb paddd xmm7,xmm11
xmm7[i] = 10 * xmm2[i] + 9 * xmm3[i]
40066f: 66 41 0f 70 ed aa pshufd xmm5,xmm13,0xaa # 0b10101010
xmm5[i] = 11 # xmm13[2]
400675: 66 0f 38 40 e9 pmulld xmm5,xmm1
xmm5[i] = 11 * xmm1[i]
40067a: 66 41 0f 70 f7 aa pshufd xmm6,xmm15,0xaa # 0b10101010
xmm6[i] = 15 # xmm15[2]
400680: 66 0f 38 40 f1 pmulld xmm6,xmm1
xmm6[i] = 15 * xmm1[i]
400685: 66 41 0f 70 cd ff pshufd xmm1,xmm13,0xff # 0b11111111
xmm1[i] = 12 # xmm13[3]
40068b: 66 0f 38 40 c8 pmulld xmm1,xmm0
xmm1[i] = 12 * xmm0[i]
400690: 66 0f fe cd paddd xmm1,xmm5
xmm1[i] = 12 * xmm0[i] + 11 * xmm1[i]
400694: 66 0f fe cf paddd xmm1,xmm7
xmm1[i] = 12 * xmm0[i] + 11 * xmm1[i] + 10 * xmm2[i] + 9 * xmm3[i]
400698: 66 41 0f fe e6 paddd xmm4,xmm14
xmm4[i] = 14 * xmm2[i] + 13 * xmm3[i]
40069d: 66 41 0f 70 ff ff pshufd xmm7,xmm15,0xff # 0b11111111
xmm7[i] = 16 # xmm15[i]
4006a3: 66 0f 38 40 f8 pmulld xmm7,xmm0
xmm7[i] = 16 * xmm0[i]
4006a8: 66 0f fe fe paddd xmm7,xmm6
xmm7[i] = 16 * xmm0[i] + 15 * xmm1[i]
4006ac: 66 0f fe fc paddd xmm7,xmm4
xmm7[i] = 16 * xmm0[i] + 15 * xmm1[i] + 14 * xmm2[i] + 13 * xmm3[i]
4006b0: 66 0f 6f 05 c8 09 20 movdqa xmm0,XMMWORD PTR [rip+0x2009c8] # 601080 <data+0x40>
4006b7: 00
xmm0 = {0x1, 0x0, 0x96433d, 0x0}
4006b8: 66 0f 70 e0 aa pshufd xmm4,xmm0,0xaa # 0b10101010
xmm4[i] = 0x96433d # xmm0[2]
4006bd: 66 0f 70 e8 00 pshufd xmm5,xmm0,0x0 # 0b00000000
xmm5[i] = 0x1 # xmm0[0]
4006c2: ba e8 03 00 00 mov edx,0x3e8
edx = 1000
4006c7: 66 0f 6f c7 movdqa xmm0,xmm7
xmm7[i] = 16 * xmm0[i] + 15 * xmm1[i] + 14 * xmm2[i] + 13 * xmm3[i]
4006cb: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
### inner loop effectively computes "mod 0x96433d"
L2:
4006d0: 66 0f 6f f4 movdqa xmm6,xmm4
xmm6[i] = 0x96433d
4006d4: 66 0f 66 f3 pcmpgtd xmm6,xmm3
xmm6[i] = 0x96433d > xmm3[i] ? -1 : 0
4006d8: 66 0f fe f5 paddd xmm6,xmm5
xmm6[i] = 0x96433d > xmm3[i] ? 0 : 1
4006dc: 66 0f 38 40 f4 pmulld xmm6,xmm4
xmm6[i] = 0x96433d > xmm3[i] ? 0 : 0x96433d
4006e1: 66 0f fa de psubd xmm3,xmm6
xmm3[i] = 0x96433d > xmm3[i] ? xmm3[i] : xmm3[i] - 0x96433d
4006e5: 66 0f 6f f4 movdqa xmm6,xmm4
4006e9: 66 0f 66 f2 pcmpgtd xmm6,xmm2
4006ed: 66 0f fe f5 paddd xmm6,xmm5
4006f1: 66 0f 38 40 f4 pmulld xmm6,xmm4
4006f6: 66 0f fa d6 psubd xmm2,xmm6
xmm2[i] = 0x96433d > xmm2[i] ? xmm2[i] : xmm2[i] - 0x96433d
4006fa: 66 0f 6f f4 movdqa xmm6,xmm4
4006fe: 66 0f 66 f1 pcmpgtd xmm6,xmm1
400702: 66 0f fe f5 paddd xmm6,xmm5
400706: 66 0f 38 40 f4 pmulld xmm6,xmm4
40070b: 66 0f fa ce psubd xmm1,xmm6
xmm1[i] = 0x96433d > xmm1[i] ? xmm1[i] : xmm1[i] - 0x96433d
40070f: 66 0f 6f f4 movdqa xmm6,xmm4
400713: 66 0f 66 f0 pcmpgtd xmm6,xmm0
400717: 66 0f fe f5 paddd xmm6,xmm5
40071b: 66 0f 38 40 f4 pmulld xmm6,xmm4
400720: 66 0f fa c6 psubd xmm0,xmm6
xmm0[i] = 0x96433d > xmm0[i] ? xmm0[i] : xmm0[i] - 0x96433d
400724: 83 c2 ff add edx,0xffffffff
400727: 75 a7 jne 4006d0 <main+0x160>
if (--edx != 0) goto L2;
400729: 48 83 c0 01 add rax,0x1
40072d: 48 39 c8 cmp rax,rcx
400730: 0f 85 6a fe ff ff jne 4005a0 <main+0x30>
if (++rax != rcx) goto L1;
400736: 48 83 ec 18 sub rsp,0x18
40073a: 66 0f 7f 1c 24 movdqa XMMWORD PTR [rsp],xmm3
40073f: 66 0f 7f 54 24 10 movdqa XMMWORD PTR [rsp+0x10],xmm2
400745: 66 0f 7f 4c 24 20 movdqa XMMWORD PTR [rsp+0x20],xmm1
40074b: 66 0f 7f 44 24 30 movdqa XMMWORD PTR [rsp+0x30],xmm0
400751: 31 c0 xor eax,eax
eax = 0
400753: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
40075a: 00 00 00
40075d: 0f 1f 00 nop DWORD PTR [rax]
400760: 0f b6 0c 44 movzx ecx,BYTE PTR [rsp+rax*2]
400764: 30 88 90 10 60 00 xor BYTE PTR [rax+0x601090],cl
40076a: 0f b6 4c 44 01 movzx ecx,BYTE PTR [rsp+rax*2+0x1]
40076f: 30 88 91 10 60 00 xor BYTE PTR [rax+0x601091],cl
400775: 48 83 c0 02 add rax,0x2
400779: 48 83 f8 20 cmp rax,0x20
40077d: 75 e1 jne 400760 <main+0x1f0>
# xmm3, xmm2, xmm1, xmm0 is used to xor with the following bytes:
# fc14eb09 bcaee747 4fe37cc1 52a5028e
# 8971c88d 9623016d 71405aea fd461d23
40077f: be 54 08 40 00 mov esi,0x400854
400784: bf 01 00 00 00 mov edi,0x1
400789: ba 05 00 00 00 mov edx,0x5
40078e: 31 c0 xor eax,eax
400790: e8 cb fc ff ff call 400460 <write@plt>
400795: be 90 10 60 00 mov esi,0x601090
40079a: bf 01 00 00 00 mov edi,0x1
40079f: ba 20 00 00 00 mov edx,0x20
4007a4: 31 c0 xor eax,eax
4007a6: e8 b5 fc ff ff call 400460 <write@plt>
4007ab: be 5a 08 40 00 mov esi,0x40085a
4007b0: bf 01 00 00 00 mov edi,0x1
4007b5: ba 02 00 00 00 mov edx,0x2
4007ba: 31 c0 xor eax,eax
4007bc: e8 9f fc ff ff call 400460 <write@plt>
4007c1: 31 ff xor edi,edi
4007c3: e8 a8 fc ff ff call 400470 <exit@plt>
4007c8: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
4007cf: 00
Then we found out that the registers are initialised to:
- xmm0 = {0x0, 0x0, 0x0, 0x1}
- xmm1 = {0x0, 0x0, 0x1, 0x0}
- xmm2 = {0x0, 0x1, 0x0, 0x0}
- xmm3 = {0x1, 0x0, 0x0, 0x0}
The outer loop computes the following for 1234567890123456789 times:
- xmm0[i] = 16 * xmm0[i] + 15 * xmm1[i] + 14 * xmm2[i] + 13 * xmm3[i]
- xmm1[i] = 12 * xmm0[i] + 11 * xmm1[i] + 10 * xmm2[i] + 9 * xmm3[i]
- xmm2[i] = 8 * xmm0[i] + 7 * xmm1[i] + 6 * xmm2[i] + 5 * xmm3[i]
- xmm3[i] = 4 * xmm0[i] + 3 * xmm1[i] + 2 * xmm2[i] + 1 * xmm3[i]
- inner loop
This effectively is a multiplication of matrix:
The inner loop computes the following for 1000 times:
- xmm0[i] = 0x96433d > xmm0[i] ? xmm0[i] : xmm0[i] - 0x96433d
- xmm1[i] = 0x96433d > xmm1[i] ? xmm1[i] : xmm1[i] - 0x96433d
- xmm2[i] = 0x96433d > xmm2[i] ? xmm2[i] : xmm2[i] - 0x96433d
- xmm3[i] = 0x96433d > xmm3[i] ? xmm3[i] : xmm3[i] - 0x96433d
Since , it effectively computes “” for each element of xmm0
..xmm3
.
To summarise, one iteration of outer loop will compute:
And the final block will compute flag using the computed xmm0..xmm3 result, by concatenating xmm3, xmm2, xmm1, xmm0 in that order and exclusive-or’ing the value with the constant bytes fc14eb09 bcaee747 4fe37cc1 52a5028e 8971c88d 9623016d 71405aea fd461d23
Rewrite the program with an efficient algorithm
Now we understand the program as matrix multiplication and modulo. However, the outer loop is iterated for incredibly many times, it cannot be directly computed. We simplify the computation by using power of matrix.
Since the initial value of the xmm registers are a unit matrix, as shown below:
The final value would be:
We optimised the computation of power of matrix by using the following formula, to compute the power of matrix in time, where is the exponent:
chal10-solve.py:
import numpy
import struct
import sys
print(sys.getrecursionlimit())
# create 4x4 array
matrix = numpy.array(
[[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
def powmod(mat, n, m):
if n == 1:
return numpy.mod(mat, m)
else:
x = powmod(mat, n//2, m)
x = numpy.mod(numpy.matmul(x, x), m)
if n % 2 == 1:
return numpy.mod(numpy.matmul(x, mat), m)
else:
return x
def powmod_slow(mat, n, m):
x = mat
for i in range(1, n):
x = numpy.mod(numpy.matmul(x, mat), m)
return x
#for i in range(1, 20):
# x = powmod(matrix, i, 0x96433d)
# for j in range(4):
# print(' {0x%x, 0x%x, 0x%x, 0x%x}, ' % tuple(x[j]))
result = powmod(matrix, 0x112210f47de98115, 0x96433d)
result = result.astype('<u4')
print(result)
key = list(result.tobytes())
# print(key)
encrypted = list(bytes.fromhex('fc14eb09bcaee7474fe37cc152a5028e8971c88d9623016d71405aeafd461d23'))
# print(encrypted)
for i in range(0, 0x20, 2):
encrypted[i] ^= key[i*2]
encrypted[i+1] ^= key[i*2+1]
# print(encrypted)
print(bytes(encrypted))
$ python3 chal10-solve.py
1000
[[7938225 4160415 382605 6452408]
[8295223 6860620 5426017 3991414]
[8652221 9560825 621816 1530420]
[9009219 2413417 5665228 8917039]]
b'M4tr1x_3xp0n3nti4t1on_5728391723'
by Fukutomo