使用 Arm 内联 GCC 程序集加载 16 位或更大立即数

如何解决使用 Arm 内联 GCC 程序集加载 16 位或更大立即数

注意：为了简洁起见，这里的示例被简化了，所以它们不能证明我的意图。如果我只是像示例中那样写入内存位置，那么 C 将是最好的方法。但是，我正在做一些我不能使用 C 的事情，所以请不要因为这个特定示例最好保留在 C 中而拒绝投票。

我正在尝试使用值加载寄存器，但我坚持使用 8 位立即数。

我的代码：

#include <cstdint>

void a(uint32_t value) {
    *(volatile uint32_t *)(0x21014) = value;
}

void b(uint32_t value) {
    asm (
        "push ip                                \n\t"
        "mov ip,%[gpio_out_addr_high]    \n\t"
        "lsl ip,ip,#8 \n\t"
        "add ip,%[gpio_out_addr_low]     \n\t"
        "lsl ip,#2 \n\t"
        "str %[value],[ip]                     \n\t"
        "pop ip                                 \n\t"
        : 
        : [gpio_out_addr_low]  "I"((0x21014 >> 2)     & 0xff),[gpio_out_addr_high] "I"((0x21014 >> (2+8)) & 0xff),[value] "r"(value)
    );
}

// adding -march=ARMv7E-M will not allow 16-bit immediate
// void c(uint32_t value) {
//     asm (
//         "mov ip,%[gpio_out_addr]    \n\t"
//         "str %[value],[ip]                     \n\t"
//         : 
//         : [gpio_out_addr]  "I"(0x1014),//           [value] "r"(value)
//     );
// } 


int main() {
    a(20);
    b(20);
    return 0;
}

当我编写 C 代码（参见 a()）时，它会在 Godbolt 中组装为：

a(unsigned char):
        mov     r3,#135168
        str     r0,[r3,#20]
        bx      lr

我认为它使用 MOV 作为伪指令。当我想在汇编中做同样的事情时，我可以将值放入某个内存位置并使用 LDR 加载它。我认为这就是我使用 -march=ARMv7E-M 时 C 代码的组装方式（MOV 被替换为 LDR），但是在许多情况下，这对我来说并不实用，因为我将做其他事情。

对于 0x21014 地址，前 2 位为零，因此当我正确移位时，我可以将这个 18 位数字视为 16 位，这就是我在 b() 中所做的，但我仍然必须用 8 位立即数传递它。但是，在 Keil 文档中，我注意到提到了 16 位立即数：

https://www.keil.com/support/man/docs/armasm/armasm_dom1359731146992.htm

https://www.keil.com/support/man/docs/armasm/armasm_dom1361289878994.htm

在 ARMv6T2 及更高版本中，ARM 和 Thumb 指令集包括：

A MOV instruction that can load any value in the range 0x00000000 to 0x0000FFFF into a register.
A MOVT instruction that can load any value in the range 0x0000 to 0xFFFF into the most significant half of a register,without altering

最低有效一半的内容。

我认为我的 CortexM4 应该是 ARMv7E-M 并且应该满足这个“ARMv6T2 及更高版本”的要求并且应该能够使用 16 位立即数。

但是从 GCC 内联汇编文档中我没有看到这样的提及：

https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html

当我启用 ARMv7E-M 架构并取消注释 c() 时，我立即使用常规的“I”，然后我收到编译错误：

<source>: In function 'void c(uint8_t)':
<source>:29:6: warning: asm operand 0 probably doesn't match constraints
   29 |     );
      |      ^
<source>:29:6: error: impossible constraint in 'asm'

所以我想知道有没有办法在 GCC 内联汇编中使用 16 位立即数，还是我遗漏了什么（这会使我的问题变得无关紧要）？

附带问题，是否可以在 Godbolt 中禁用这些伪指令？我已经看到它们也与 RISC-V 程序集一起使用，但我更愿意查看反汇编的真实字节码，以了解这些伪/宏程序集指令产生的确切指令。

解决方法

@Jester 在评论中建议要么使用 i 约束传递更大的立即数，要么使用真实的 C 变量，用所需的值初始化它并让内联程序集接受它。这听起来是最好的解决方案，在内联汇编中花费的时间越少越好，想要更好性能的人们往往低估了 C/C++ 工具链在给定正确代码的情况下在优化方面的强大功能，并且对于许多重写 C/C++ 代码的人来说回答而不是重做组装中的所有内容。 @Peter Cordes 提到不使用内联汇编，我同意。然而，在这种情况下，某些指令的确切时间安排至关重要，我不能冒险让工具链稍微不同地优化某些指令的时间安排。

Bit-banging 协议并不理想，在大多数情况下，答案是避免 bit-banging，但在我的情况下，这不是那么简单，其他方法也不起作用：

SPI 无法用于传输数据，因为我需要推送更多信号，并且具有任意长度，而我的硬件仅支持 8 位/16 位。
尝试使用 DMA2GPIO，但遇到抖动问题。
尝试过 IRQ 处理程序，它的开销太大，我的性能下降（如下所示，只有 2 个 nops，因此空闲时间没有太多空间可做）。
尝试预烘焙位流（包括时序），但是对于 1 字节的真实数据，我最终节省了 64 字节的流数据，并且从内存中读取的整体速度要慢得多。
每个写入值的预支持函数（并有一个函数查找表，每个值写入）工作得很好，实际上太快了，因为现在工具链具有编译时已知值并且能够很好地优化它，我的TCK在40MHz以上。问题是我必须添加很多延迟以将其降低到所需的速度（8MHz），并且必须为每个输入值完成，当长度为 8 位或更少时，这很好，但对于 32-位长度无法放入闪存 (2^32 => 4294967296) 并且将单个 32 位访问拼接成四个 8 位访问会在 TCK 信号上引入大量抖动。
在 FPGA 结构中实现这个外设，可以让我控制一切，通常这是正确的答案，但我想尝试在没有结构的设备上实现它。

长话短说，bit-banging 是不好的，并且大多数情况下有更好的方法可以解决它，并且在不知道的情况下使用内联汇编实际上可能会产生更糟糕的结果，但在我的情况下，我需要它。在我之前的代码中，我试图专注于一个关于立即数的简单问题，而不是讨论切线或 X-Y 问题。

现在回到“将更大的立即数传递给程序集”的主题，这是一个更真实的例子的实现：

https://godbolt.org/z/5vbb7PPP5

#include <cstdint>

const uint8_t TCK = 2;
const uint8_t TMS = 3;
const uint8_t TDI = 4;
const uint8_t TDO = 5;

template<uint8_t number>
constexpr uint8_t powerOfTwo() {
    static_assert(number <8,"Output would overflow,the JTAG pins are close to base of the register and you shouldn't need PIN8 or above anyway");
    int ret = 1;
    for (int i=0; i<number; i++) {
        ret *= 2;
    }
    return ret;
}

template<uint8_t WHAT_SIGNAL>
__attribute__((optimize("-Ofast")))
uint32_t shiftAsm(const uint32_t length,uint32_t write_value) {
    uint32_t addressWrite = 0x40021014; // ODR register of GPIO port E (normally not hardcoded,but just for godbolt example it's like this)
    uint32_t addressRead  = 0x40021010; // IDR register of GPIO port E (normally not hardcoded,but just for godbolt example it's like this)

    uint32_t count     = 0;
    uint32_t shift_out = 0;
    uint32_t shift_in  = 0;
    uint32_t ret_value = 0;

    asm volatile (
    "cpsid if                                                  \n\t"  // Disable IRQ
    "repeatForEachBit%=:                                       \n\t"

    // Low part of the TCK
    "and.w %[shift_out],%[write_value],#1               \n\t"  // shift_out = write_value & 1
    "lsls  %[shift_out],%[shift_out],%[write_shift]   \n\t"  // shift_out = shift_out << pin_shift
    "str   %[shift_out],[%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out

    // On the first cycle this is redundant,as it processed the shift_in from the previous iteration.
    // First iteration is safe to do extraneously as it's just doing zeros
    "lsr   %[shift_in],%[shift_in],%[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],#1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],#1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],%[ret_value],%[shift_in]      \n\t"  // ret = ret | shift_in

    // Prepare things that are needed toward the end of the loop,but can be done now
    "orr.w %[shift_out],%[clock_mask]    \n\t"  // shift_out = shift_out | (1 << TCK)
    "lsr   %[write_value],#1               \n\t"  // write_value = write_value >> 1
    "adds  %[count],#1                                  \n\t"  // count++
    "cmp   %[count],%[length]                           \n\t"  // if (count != length) then ....

    // High part of the TCK + sample
    "str   %[shift_out],[%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out
    "nop                                                       \n\t"
    "nop                                                       \n\t"
    "ldr   %[shift_in],[%[gpio_in_addr]]                   \n\t"  // shift_in = GPIO
    "bne.n repeatForEachBit%=                                  \n\t"  // if (count != length) then  repeatForEachBit

    "cpsie if                                                  \n\t"  // Enable IRQ - the critical part finished

    // Process the shift_in as normally it's done in the next iteration of the loop
    "lsr   %[shift_in],%[shift_in]      \n\t"  // ret = ret | shift_in

    // Outputs
    : [ret_value]       "+r"(ret_value),[count]           "+r"(count),[shift_out]       "+r"(shift_out),[shift_in]        "+r"(shift_in)

    // Inputs
    : [gpio_out_addr]   "r"(addressWrite),[gpio_in_addr]    "r"(addressRead),[length]          "r"(length),[write_value]     "r"(write_value),[write_shift]     "M"(WHAT_SIGNAL),[read_shift]      "M"(TDO),[clock_mask]      "I"(powerOfTwo<TCK>())

    // Clobbers
    : "memory"
    );

    return ret_value;
}

int main() {
    shiftAsm<TMS>(7,0xff);                  // reset the target TAP controler
    shiftAsm<TMS>(3,0x12);                  // go to state some arbitary TAP state
    shiftAsm<TDI>(32,0xdeadbeef);            // write to target

    auto ret = shiftAsm<TDI>(16,0x0000);     // read from the target

    return 0;
}

@David Wohlferd 关于减少汇编的评论将为工具链提供更多机会进一步优化“将地址加载到寄存器中”，在内联的情况下不应再次加载地址（因此它们只完成一次但是有多次调用读/写）。这里启用了内联：

https://godbolt.org/z/K8GYYqrbq

问题，值得吗？我想是的，我的 TCK 是死点 8MHz，我的占空比接近 50%，而我对占空比保持原样更有信心。采样是在我期望它完成的时候完成的，不用担心它会因不同的工具链设置而得到不同的优化。

使用 Arm 内联 GCC 程序集加载 16 位或更大立即数

如何解决使用 Arm 内联 GCC 程序集加载 16 位或更大立即数

解决方法

相关推荐