Strange bug in transmission of float values over TCP/IP

Strange bug in transmission of float values over TCP/IP

axilmar

Member #1,204

April 2001

Hello all.

I have an extremely strange bug.

I have two applications that communicate over TCP/IP.

Application A is the server, and application B is the client.

Application A sends a bunch of float values to application B every 100 milliseconds.

The bug is the following: sometimes some of the float values received by application B are not the same as the values transmitted by application A.

Initially, I thought there was a problem with the Ethernet or TCP/IP drivers (some sort of data corruption). I then tested the code in other Windows machines, but the problem persisted.

I then tested the code on Linux (Ubuntu 10.04.1 LTS) and the problem is still there!!!

The values are logged just before they are sent and just after they are received.

The code is pretty straightforward: the message protocol has a 4 byte header like this:

#SelectExpand
  1//message header
  2struct MESSAGE_HEADER {
  3    unsigned short type;
  4    unsigned short length;
  5};
  6
  7//orientation message
  8struct ORIENTATION_MESSAGE : MESSAGE_HEADER
  9{
 10  float azimuth;
 11  float elevation;
 12  float speed_az;
 13  float speed_elev;
 14};
 15
 16//any message
 17struct MESSAGE : MESSAGE_HEADER {
 18    char buffer[512];
 19};
 20
 21//receive specific size of bytes from the socket
 22static int receive(SOCKET socket, void *buffer, size_t size) {
 23    int r;
 24    do {
 25        r = recv(socket, (char *)buffer, size, 0);
 26        if (r == 0 || r == SOCKET_ERROR) break;
 27        buffer = (char *)buffer + r;
 28        size -= r;
 29    } while (size);
 30    return r;
 31}
 32
 33//send specific size of bytes to a socket
 34static int send(SOCKET socket, const void *buffer, size_t size) {
 35    int r;
 36    do {
 37        r = send(socket, (const char *)buffer, size, 0);
 38        if (r == 0 || r == SOCKET_ERROR) break;
 39        buffer = (char *)buffer + r;
 40        size -= r;
 41    } while (size);
 42    return r;
 43}
 44
 45//get message from socket
 46static bool receive(SOCKET socket, MESSAGE &msg) {
 47    int r = receive(socket, &msg, sizeof(MESSAGE_HEADER));
 48    if (r == SOCKET_ERROR || r == 0) return false;
 49    if (ntohs(msg.length) == 0) return true;
 50    r = receive(socket, msg.buffer, ntohs(msg.length));
 51    if (r == SOCKET_ERROR || r == 0) return false;
 52    return true;
 53}
 54
 55//send message
 56static bool send(SOCKET socket, const MESSAGE &msg) {
 57    int r = send(socket, &msg, ntohs(msg.length) + sizeof(MESSAGE_HEADER));
 58    if (r == SOCKET_ERROR || r == 0) return false;
 59    return true;
 60}

When I receive the message 'orientation', sometimes the 'azimuth' value is different from the one sent by the server!

Shouldn't the data be the same all the time? doesn't TCP/IP guarantee delivery of the data uncorrupted? could it be that an exception in the math co-processor affects the TCP/IP stack? is it a problem that I receive a small number of bytes first (4 bytes) and then the message body?

ALGUI: c++11 A5 GUI library.

Thomas Fjellstrom

Member #476

June 2000

First thing I'd suggest is to not send the struct raw like that. Serialize it properly and send it out. By default structs can have a fair amount of padding between elements depending on types and order. And different versions of compilers, and different compilers may align elements differently. So its best to not send structs directly.

--
Thomas Fjellstrom - [website] - [email] - [Allegro Wiki] - [Allegro TODO]
"If you can't think of a better solution, don't try to make a better solution." -- weapon_S
"The less evidence we have for what we believe is certain, the more violently we defend beliefs against those who don't agree" -- https://twitter.com/neiltyson/status/592870205409353730

Oscar Giner

Member #2,207

April 2002

Are the server and client compiled with the same compiler and version, and with the same compile flags (ones that affect how certain floating point operations are executed)? And even then, runing each one on a different CPU may lead to slightly different results (IEEE specifies floating point representation, but not operations on them, so a simple 32 bit float -> 80 bit float (as the x86 FPU operates with 80 bit floats) conversion may yield different results between CPU's).

So don't use floats with network applications (or any application where different computers must return exactly the same value). Floating point is not designed for 100% accurate operations.

--
[Website | e-mail]
[Tetris Unlimited] [AllegAVI | AlText]

kazzmir

Member #1,786

December 2001

In my game I shift all floating point values to the left of the decimal by 2 and send an integer, then shift it back on the receiving side.

int to_send = (int)(some_float * 100);
...
float received = recv() / 100.0;

SiegeLord

Member #7,827

October 2006

Hmm... I'm a tad confused. When you send a float like that (via fixed point) you're relying on the integer to be precisely transported over the network. What is the difference between that and storing the float's bit pattern inside the int with full precision? After all, float bit patterns are well standardized.

All the other CPU differences are endemic to using floats period, and have nothing to do with transporting them over a network.

"For in much wisdom is much grief: and he that increases knowledge increases sorrow."-Ecclesiastes 1:18
[SiegeLord's Abode][Codes]:[DAllegro5]:[RustAllegro]

bamccaig

Member #7,536

July 2006

I don't think that you can rely on all machines representing the floating point number in exactly the same bit pattern... It's probably best to send character data and parse it.

--
I mean the best with what I say. It doesn't always sound that way.

GullRaDriel

Member #3,861

September 2003

Use double. Float aren't normalized as much as double.

I used to need the same thing as you and so I went in the same problem. The tests showed it to work with double because it's IEEEEEEEEEEEEEE I don't know what.

"Code is like shit - it only smells if it is not yours"
Allegro Wiki, full of examples and articles !!

SiegeLord

Member #7,827

October 2006

bamccaig said:

I don't think that you can rely on all machines representing the floating point number in exactly the same bit pattern... It's probably best to send character data and parse it.

I'd need evidence to prove that to me. IEEE 754 standard strictly defines the bit patterns of valid floats (I think it gives leeway for NaN's). I can't think of any system in wide use that does not implement IEEE 754.

"For in much wisdom is much grief: and he that increases knowledge increases sorrow."-Ecclesiastes 1:18
[SiegeLord's Abode][Codes]:[DAllegro5]:[RustAllegro]

kazzmir

Member #1,786

December 2001

You may be right that float/double's can be sent over the network to arbitrary CPU's but I do not know if this is strictly true so I played it safe and just used integers.

SiegeLord

Member #7,827

October 2006

int a = 0;
a = 0; // Set it again, just to be sure

"For in much wisdom is much grief: and he that increases knowledge increases sorrow."-Ecclesiastes 1:18
[SiegeLord's Abode][Codes]:[DAllegro5]:[RustAllegro]

Billybob

Member #3,136

January 2003

SiegeLord said:

int a = 0;
a = 0; // Set it again, just to be sure

QFT.

My bet is on struct padding.

Evert

Member #794

November 2000

bamccaig said:

I don't think that you can rely on all machines representing the floating point number in exactly the same bit pattern...

Sure you can. If you know they're using IEEE floats and the processors in question use the same endianness.

GullRaDriel said:

Float aren't normalized as much as double.

Bollocks. Single precsision floats are just as standard as double precision floats. Most likely, they're both encoded using IEEE 754, and if one of them isn't, neither is the other one.

Transmitting floats in binary over a network is no different and no less portable than dumping floats to a file in binary (which is certainly something you in principle do want to be careful about because there are computers out there that store floats differently than a consumer PC does).
What is not very portable is relying on the layout of a struct to be the same from one compilation to another.

axilmar

Member #1,204

April 2001

Thomas Fjellstrom said:

True, but the code uses packing of 1, so that is not the problem.

Furthermore, if it was, the problem would be immediately visible.

Oscar Giner said:

Are the server and client compiled with the same compiler and version, and with the same compile flags (ones that affect how certain floating point operations are executed)?

Yes.

Quote:

And even then, runing each one on a different CPU may lead to slightly different results (IEEE specifies floating point representation, but not operations on them, so a simple 32 bit float -> 80 bit float (as the x86 FPU operates with 80 bit floats) conversion may yield different results between CPU's).

True, but can this account for the big differences in value? for example, a value of 0.780193 in the server becomes 0.790193 in the client. Can the value difference be 0.010?

Quote:

So don't use floats with network applications (or any application where different computers must return exactly the same value). Floating point is not designed for 100% accurate operations.

Unfortunately, I have to use floats because it's in the specification protocol given by the client.

Evert said:

Most likely, they're both encoded using IEEE 754

True. The protocol specifies IEEE 754 floats.

ALGUI: c++11 A5 GUI library.

Arthur Kalliokoski

Second in Command

February 2005

axilmar said:

a value of 0.780193 in the server becomes 0.790193 in the client.

Pics or it didn't happen :X

They all watch too much MSNBC... they get ideas.

Evert

Member #794

November 2000

axilmar said:

for example, a value of 0.780193 in the server becomes 0.790193 in the client. Can the value difference be 0.010?

No.^[1]
Remember, you're not doing calculations here, just sending numbers across.

You can do the following experiment: read the float value in as 32-bit integer and examine the bit pattern. These should be identical. If they are, and yet the float values are different... well, I'm not sure what to suggest, except to interpret the float explicitly and reconstruct its value "by hand". If they're not the same, there's a bug somewhere.

References

Yes, in a calculation, you can if you're not careful - especially with single precision.

Arthur Kalliokoski

Second in Command

February 2005

axilmar said:

0.780193 in the server becomes 0.790193 in the client.

The hex representations are

0.780193 = 0x3F47BABA
0.790193 = 0x3F4A4A17

It's quite a remarkable coincidence to alter the pattern to the second pattern at random.

They all watch too much MSNBC... they get ideas.

axilmar

Member #1,204

April 2001

Arthur Kalliokoski said:

Pics or it didn't happen

I've attached a pic of the problem. The server transmits the value -0.830673, and the client receives the value -0.831650. The pic is from Excel, the columns as set to type 'number' with 6 decimal digits.

It's not a rounding issue with Excel, because the Excel data come from .csv files produced by logging the data directly in the client and server, and the same values exist in the .csv files.

Evert said:

Good idea. I am also going to use Wireshark to see what are the actual data transmitted over the network.

EDIT:

I added another picture that shows the transmitted/received bytes at server/client. There is a difference between the bytes transmitted and the bytes received.

ALGUI: c++11 A5 GUI library.

Tobias Dammers

Member #2,604

August 2002

If the bytes sent differ from the bytes received, then the only thing I can think of is a firewall or router between client and server that somehow misinterprets the bytes; maybe something along the way is trying to convert between character encodings.

---
Me make music: Triofobie
---
"We need Tobias and his awesome trombone, too." - Johan Halmén

GullRaDriel

Member #3,861

September 2003

Tobias Dammers said:

maybe something along the way is trying to convert between character encodings.

I wouldn't expect a router from doing that much. TCP is guaranteed to give you the exact buffer you gave in the enter. Expecting it to do some conversion would break it itself.

axilmar said:

There is a difference between the bytes transmitted and the bytes received.

There lies your problem. The data is filled with garbage at the end point.

Edit: Are you checking the return values of your function, are you sure it's not the MESSAGE_HEADER management who's broken and who's causing you to not receive the good amount of data ?

Edit2:

Thomas said:

Quoted for thruth !!! I didn't noticed it before, but you must not send structs directly on the network. The byte order of each computer can be different. Serialize.

My own send and recv is working like that:

-htonl of type
-send type
-htonl of size
-send size
-send buffer who's length is size

-recv type
-type = ntohl type
-recv size
-size = ntohl size
-recv buffer who's length is size

"Code is like shit - it only smells if it is not yours"
Allegro Wiki, full of examples and articles !!

Billybob

Member #3,136

January 2003

You never show how a MESSAGE is constructed before sending, or re-constructed after receiving.

Evert

Member #794

November 2000

I'd suggest checking for parity bits, but it's a bit odd if that only affects the one number.
Anyway, if the bit patterns are different, then your problem has nothing to do with floats per se and the same problem would/should show up with integer data. Or any data really.

You do have access to both the server code and the client code? Does the problem persist if you send the data over a local socket?

axilmar

Member #1,204

April 2001

I think I found the problem. The endianess swapping routine does not work for floats.

If this code is run:

#SelectExpand
  1#include <iostream>
  2using namespace std;
  3
  4float ntohf(float f)
  5{
  6  float r;
  7  unsigned char *s = (unsigned char *)&f;
  8  unsigned char *d = (unsigned char *)&r;
  9  d[0] = s[3];
 10  d[1] = s[2];
 11  d[2] = s[1];
 12  d[3] = s[0];
 13  return r;
 14}
 15
 16int main() {
 17  unsigned long l = 3206974079;
 18  float f1 = (float &)l;
 19  float f2 = ntohf(ntohf(f1));
 20  unsigned char *c1 = (unsigned char *)&f1;
 21  unsigned char *c2 = (unsigned char *)&f2;
 22  printf("%02X %02X %02X %02X\n", c1[0], c1[1], c1[2], c1[3]);
 23  printf("%02X %02X %02X %02X\n", c2[0], c2[1], c2[2], c2[3]);
 24  getchar();
 25  return 0;
 26}

It outputs the following:

7F 8A 26 BF
7F CA 26 BF

The two lines should be identical, but they are not.

Does anyone have an idea why this is happening?

ALGUI: c++11 A5 GUI library.

Thomas Fjellstrom

Member #476

June 2000

TCP already has parity bits. If something went wrong the packet never would have made it. Unless the receiver corrupted it after the TCP stack was done with it.

Evert

Member #794

November 2000

axilmar said:

Does anyone have an idea why this is happening?

Unless there is something peculiar about C++ I don't know about, the following C program should be identical:

#SelectExpand
  1#include <stdio.h>
  2
  3float ntohf(float f)
  4{
  5  float r;
  6  unsigned char *s = (unsigned char *)&f;
  7  unsigned char *d = (unsigned char *)&r;
  8  d[0] = s[3];
  9  d[1] = s[2];
 10  d[2] = s[1];
 11  d[3] = s[0];
 12  return r;
 13}
 14
 15int main() {
 16  unsigned long l = 3206974079;
 17  float f1 = *((float *)&l);
 18  float f2 = ntohf(ntohf(f1));
 19  unsigned char *c1 = (unsigned char *)&f1;
 20  unsigned char *c2 = (unsigned char *)&f2;
 21  printf("%02X %02X %02X %02X\n", c1[0], c1[1], c1[2], c1[3]);
 22  printf("%02X %02X %02X %02X\n", c2[0], c2[1], c2[2], c2[3]);
 23  return 0;
 24}

(Yes, I basically copied your code; I'd use a union instead of those fugly casts). This gives

7F 8A 26 BF
7F 8A 26 BF

as expected. So I would say that there is either a problem with your compiler, or your hardware...

Thomas Fjellstrom said:

TCP already has parity bits. If something went wrong the packet never would have made it.

I was thinking something could have stuck an extra layer of parity bits in there. Well, actually, I didn't actually think that, but it was one of the only things I could think of that would give you different numbers on the sender and the receiver.

axilmar

Member #1,204

April 2001

I actually went through the program in assembly...the function ntof, instead of returning the float through a 32-bit register, pushed the value to the floating point stack of the co-processor.

The floating point stack accepts 80-bit values, and therefore the float value expanded from 32 to 80 bits.

When the caller read the value, the value was extracted from the floating point stack, and converted from 80 bits to 32 bits. This caused a rounding problem.

EDIT:

The rounding problem was magnified because it happened on the swapped float, not on the original one.

ALGUI: c++11 A5 GUI library.