精确计算两个乘积之和与差的浮点

如何解决精确计算两个乘积之和与差的浮点

两个乘积之差和两个乘积之和是在各种常见计算中发现的两个基元。 diff_of_products（a，b，c，d）：= ab-cd和sum_of_products（a，b，c，d）：= ab + cd是密切相关的伴随函数，它们的区别仅在于某些操作数的符号不同。这些原语的使用示例如下：

x =（a + i b）和y =（c + i d）的复数乘法的计算：

x * y = diff_of_products（a，c，b，d）+ i sum_of_products（a，d，b，c）

2x2矩阵行列式的计算：diff_of_products（a，d，b，c）：

| a  b |
| c  d |

在假设h和相邻的导管a的直角三角形computation of the length of the opposite cathesus中：diff_of_products（h，h，a，a）

使用正判别式计算两个实数solutions of a quadratic equation：

q =-（b + copysign（sqrt（diff_of_products（b，b，4a，c）），b））/ 2
x ₀ = q / a
x ₁ = c / q

Computation of a 3D cross product a = b⨯c：

a _x = diff_of_products（b _y，c _z，b _z，c _y）
a _y = diff_of_products（b _z，c _x，b _x，c _z）
a _z = diff_of_products（b _x，c _y，b _y，c _x）

使用IEEE-754二进制浮点格式进行计算时，除了明显的潜在溢出和下溢问题外，当两个乘积的幅值相似但sum_of_products（）的符号相反时，两个函数的幼稚实现都可能遭受灾难性的取消。或diff_of_products（）的相同符号。

仅关注精度方面，如何在IEEE-754二进制算术上下文中可靠地实现这些功能？可以假定融合乘加运算的可用性，因为大多数现代处理器体系结构都支持该运算，并通过标准功能将其公开在许多编程语言中。在不失一般性的前提下，可以将讨论限制为单精度（IEEE-754 binary32）格式，以便于阐述和测试。

解决方法

融合乘加（FMA）操作在提供针对减法抵消的保护方面的实用性源于最终加法中全双倍宽度乘积的参与。据我所知，其用于准确和稳健地计算二次方程解的效用的第一个公开记录是著名浮点专家William Kahan的两组非正式注解：

威廉·卡汉（William Kahan），“ Matlab的损失无人获利”。 1998年8月，2004年7月修订（online）
威廉·卡汉（William Kahan），“没有超精确算法的浮点计算的成本”。 2004年11月（online）

Higham进行数值计算的标准工作是我第一次遇到应用于2x2矩阵行列式计算的Kahan算法（第65页）：

Nicholas J. Higham，“数值算法的准确性和稳定性”，SIAM，1996年

三位英特尔研究人员在英特尔首款具有FMA支持的CPU（安腾处理器）（第273页）的背景下，发布了一种也基于FMA的用于计算ab + cd的不同算法：

Marius Cornea，John Harrison和Ping Tak Peter Tang：“基于Itanium的系统上的科学计算”。英特尔出版社2002年

近年来，法国研究人员发表的四篇论文详细研究了这两种算法，并提供了数学证明的误差范围。对于二进制浮点算法，假设中间计算中没有上溢或下溢，则Kahan算法和Cornea-Harrison-Tang（CHT）算法的最大相对误差均显示为两倍。渐近地舍入单位为2 u 。对于IEEE-754 binary32或单精度，此错误范围为2 ^-23；对于IEEE-754 binary64或双精度，此错误范围为2 ^{-52 。}

此外，还表明，对于二进制浮点算法，Kahan算法中的错误最多为1.5 ulps。从文献中，我不知道CHT算法的等效结果，即经过验证的ulp误差范围。我自己的实验使用下面的代码 suggest 建议误差为1.25 ulp。

Sylvie Boldo，“ Kahan算法的正确判别计算最终得到正式验证”， IEEE在计算机上的交易，第1卷。 58号，2009年2月，第220-225页（online）

Claude-Pierre Jeannerod，Nicolas Louvet和Jean-Michel Muller，“对Kahan精确计算2x2行列式的算法的进一步分析”，计算数学，第1卷。 82，第284号，2013年10月，第2245-2264页（online）

Jean-Michel Muller，“关于使用Cornea，Harrison和Tang的方法计算ab + cd的错误”，数学软件上的ACM交易，第一卷。 41，第2号，2015年1月，第7条（online）

Claude-Pierre Jeannerod，“ Cornea-Harrison-Tang方法的独立于基数的误差分析”，关于数学软件的 ACM交易 42，第3号，2016年5月，第19条（online）

尽管Kahan的算法需要四个浮点运算，其中两个是FMA，但是CHT算法需要七个浮点运算，其中两个是FMA。我在下面构建了测试框架，以探讨可能存在的其他折衷方案。我从文献中实验验证了两种算法的相对误差和Kahan算法的ulp误差的界限。我的实验表明，CHT算法提供的ulp误差范围较小，为1.25 ulp，但是它也产生了舍入不正确的结果，大约是Kahan算法的两倍。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <float.h>
#include <math.h>

#define TEST_SUM  (0)  // function under test. 0: a*b-c*d; 1: a*b+c*d 
#define USE_CHT   (0)  // algorithm. 0: Kahan; 1: Cornea-Harrison-Tang

/*
  Compute a*b-c*d with error <= 1.5 ulp. Maximum relative err = 2**-23

  Claude-Pierre Jeannerod,Nicolas Louvet,and Jean-Michel Muller,"Further Analysis of Kahan's Algorithm for the Accurate Computation 
  of 2x2 Determinants",Mathematics of Computation,Vol. 82,No. 284,Oct. 2013,pp. 2245-2264
*/
float diff_of_products_kahan (float a,float b,float c,float d)
{
    float w = d * c;
    float e = fmaf (c,-d,w);
    float f = fmaf (a,b,-w);
    return f + e;
}

/*
  Compute a*b-c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23

  Claude-Pierre Jeannerod,"A Radix-Independent Error Analysis of the 
  Cornea-Harrison-Tang Method",ACM Transactions on Mathematical Software
  Vol. 42,No. 3,Article 19 (May 2016).
*/
float diff_of_products_cht (float a,float d)
{
    float p1 = a * b; 
    float p2 = c * d;
    float e1 = fmaf (a,-p1); 
    float e2 = fmaf (c,p2);
    float r = p1 - p2; 
    float e = e1 + e2;
    return r + e;
}

/*
  Compute a*b+c*d with error <= 1.5 ulp. Maximum relative err = 2**-23

  Jean-Michel Muller,"On the Error of Computing ab+cd using Cornea,Harrison and Tang's Method",ACM Transactions on Mathematical Software,Vol. 41,No.2,Article 7,(January 2015)
*/
float sum_of_products_kahan (float a,float d)
{
    float w = c * d;
    float e = fmaf (c,w);
    return f - e;
}

/*
  Compute a*b+c*d with error <= 1.25 ulp (?). Maximum relative err = 2**-23

  Claude-Pierre Jeannerod,Article 19 (May 2016).
*/
float sum_of_products_cht (float a,d,-p2);
    float r = p1 + p2; 
    float e = e1 + e2;
    return r + e;
}

// Fixes via: Greg Rose,KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

typedef struct {
    double y;
    double x;
} dbldbl;

dbldbl make_dbldbl (double head,double tail)
{
    dbldbl z;
    z.x = tail;
    z.y = head;
    return z;
}

dbldbl add_dbldbl (dbldbl a,dbldbl b) {
    dbldbl z;
    double t1,t2,t3,t4,t5;
    t1 = a.y + b.y;
    t2 = t1 - a.y;
    t3 = (a.y + (t2 - t1)) + (b.y - t2);
    t4 = a.x + b.x;
    t2 = t4 - a.x;
    t5 = (a.x + (t2 - t4)) + (b.x - t2);
    t3 = t3 + t4;
    t4 = t1 + t3;
    t3 = (t1 - t4) + t3;
    t3 = t3 + t5;
    z.y = t4 + t3;
    z.x = (t4 - z.y) + t3;
    return z;
}

dbldbl sub_dbldbl (dbldbl a,dbldbl b)
{
    dbldbl z;
    double t1,t5;
    t1 = a.y - b.y;
    t2 = t1 - a.y;
    t3 = (a.y + (t2 - t1)) - (b.y + t2);
    t4 = a.x - b.x;
    t2 = t4 - a.x;
    t5 = (a.x + (t2 - t4)) - (b.x + t2);
    t3 = t3 + t4;
    t4 = t1 + t3;
    t3 = (t1 - t4) + t3;
    t3 = t3 + t5;
    z.y = t4 + t3;
    z.x = (t4 - z.y) + t3;
    return z;
}

dbldbl mul_dbldbl (dbldbl a,dbldbl b)
{
    dbldbl t,z;
    t.y = a.y * b.y;
    t.x = fma (a.y,b.y,-t.y);
    t.x = fma (a.x,b.x,t.x);
    t.x = fma (a.y,t.x);
    t.x = fma (a.x,t.x);
    z.y = t.y + t.x;
    z.x = (t.y - z.y) + t.x;
    return z;
}

double prod_diff_ref (float a,float d)
{
    dbldbl t = sub_dbldbl (
        mul_dbldbl (make_dbldbl ((double)a,0),make_dbldbl ((double)b,0)),mul_dbldbl (make_dbldbl ((double)c,make_dbldbl ((double)d,0))
        );
    return t.x + t.y;
}

double prod_sum_ref (float a,float d)
{
    dbldbl t = add_dbldbl (
        mul_dbldbl (make_dbldbl ((double)a,0))
        );
    return t.x + t.y;
}

float __uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r,&a,sizeof r);
    return r;
}

uint32_t __float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r,sizeof r);
    return r;
}

uint64_t __double_as_uint64 (double a)
{
    uint64_t r;
    memcpy (&r,sizeof r);
    return r;
}

static double floatUlpErr (float res,double ref)
{
    uint64_t i,j,err;
    int expoRef;
    
    /* ulp error cannot be computed if either operand is NaN,infinity,zero */
    if (isnan(res) || isnan (ref) || isinf(res) || isinf (ref) ||
        (res == 0.0f) || (ref == 0.0f)) {
        return 0.0;
    }
    /* Convert the float result to an "extended float". This is like a float
       with 56 instead of 24 effective mantissa bits.
    */
    i = ((uint64_t)__float_as_uint32(res)) << 32;
    /* Convert the double reference to an "extended float". If the reference is
       >= 2^129,we need to clamp to the maximum "extended float". If reference
       is < 2^-126,we need to denormalize because of float's limited exponent
       range.
    */
    expoRef = (int)(((__double_as_uint64(ref) >> 52) & 0x7ff) - 1023);
    if (expoRef >= 129) {
        j = (__double_as_uint64(ref) & 0x8000000000000000ULL) |
            0x7fffffffffffffffULL;
    } else if (expoRef < -126) {
        j = ((__double_as_uint64(ref) << 11) | 0x8000000000000000ULL) >> 8;
        j = j >> (-(expoRef + 126));
        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
    } else {
        j = ((__double_as_uint64(ref) << 11) & 0x7fffffffffffffffULL) >> 8;
        j = j | ((uint64_t)(expoRef + 127) << 55);
        j = j | (__double_as_uint64(ref) & 0x8000000000000000ULL);
    }
    err = (i < j) ? (j - i) : (i - j);
    return err / 4294967296.0;
}

int main (void)
{
    const float ULMT = sqrtf (FLT_MAX) / 2; // avoid overflow
    const float LLMT = sqrtf (FLT_MIN) * 2; // avoid underflow
    const uint64_t N = 1ULL << 38;
    double ref,ulp,relerr,maxrelerr = 0,maxulp = 0;
    uint64_t count = 0LL,incorrectly_rounded = 0LL;
    uint32_t ai,bi,ci,di;
    float af,bf,cf,df,resf;

#if TEST_SUM
    printf ("testing a*b+c*d ");
#else
    printf ("testing a*b-c*d ");
#endif // TEST_SUM
#if USE_CHT
    printf ("using Cornea-Harrison-Tang algorithm\n");
#else
    printf ("using Kahan algorithm\n");
#endif

    do {
        do {
            ai = KISS;
            af = __uint32_as_float (ai);
        } while (!isfinite(af) || (fabsf (af) > ULMT) || (fabsf (af) < LLMT));
        do {
            bi = KISS;
            bf = __uint32_as_float (bi);
        } while (!isfinite(bf) || (fabsf (bf) > ULMT) || (fabsf (bf) < LLMT));
        do {
            ci = KISS;
            cf = __uint32_as_float (ci);
        } while (!isfinite(cf) || (fabsf (cf) > ULMT) || (fabsf (cf) < LLMT));
        do {
            di = KISS;
            df = __uint32_as_float (di);
        } while (!isfinite(df) || (fabsf (df) > ULMT) || (fabsf (df) < LLMT));
        count++;
#if TEST_SUM        
#if USE_CHT
        resf = sum_of_products_cht (af,df);
#else // USE_CHT
        resf = sum_of_products_kahan (af,df);
#endif // USE_CHT
        ref = prod_sum_ref (af,df);
#else // TEST_SUM
#if USE_CHT
        resf = diff_of_products_cht (af,df);
#else // USE_CHT
        resf = diff_of_products_kahan (af,df);
#endif // USE_CHT
        ref = prod_diff_ref (af,df);
#endif // TEST_SUM
        ulp = floatUlpErr (resf,ref);
        incorrectly_rounded += ulp > 0.5;
        relerr = fabs ((resf - ref) / ref);
        if ((ulp > maxulp) || ((ulp == maxulp) && (relerr > maxrelerr))) {
            maxulp = ulp;
            maxrelerr = relerr;
            printf ("%13llu %12llu ulp=%.9f a=% 15.8e b=% 15.8e c=% 15.8e d=% 15.8e res=% 16.6a ref=% 23.13a relerr=%13.9e\n",count,incorrectly_rounded,af,resf,ref,relerr);
        }
    } while (count <= N);

    return EXIT_SUCCESS;
}

精确计算两个乘积之和与差的浮点

如何解决精确计算两个乘积之和与差的浮点

解决方法

相关推荐