I posted to this forum about 1 year ago concerning the receipt of the autorecovery signal from our LabJack T7 during streaming operation.
https://labjack.com/forums/t7/t7-streaming-digital-inputs-occasionally-r...
We modified our code to handle this and have been fine until December 11 when at one installation we received an LJME_INCORRECT_NUM_RESPONSE_BYTES_RECEIVED error. Since that time, we have been having frequent issues with the operation of the LabJack subsystem.
Some additional background. The system has been operational since 2015. In October 2018, the system was damaged, likely by an electrical transient event and a number of components, including the original LabJack T7 were replaced. The T7 was replaced by a spare that was purchased at the same time as the original. We then purchased new T7 units to replace the spares. The system was operating fine for the 2 months between the damage event and the first error code. In late December we replaced the T7 with the new spare, but that did not resolve the issue.
Our system is configured to stream the 16 digital inputs on the EIO and FIO ports at 100K samples per second. We also use the LabJack to periodically sample two additional digital inputs at a 1 second rate, and occasionally control two digital outputs.
When one of these error events occur the first error message is one of the following:
- INCORRECT_NUM_RESPONSE_BYTES_RECEIVED
- LJME_MBE1_ILLEGAL_FUNCTION
- STREAM_AUTO_RECOVER_END_OVERFLOW
We are running your driver installed using the LabJack-2015-06-03.exe installer. We attempted to update the driver using the LabJack-2018-08-30.exe installer, but found frequent autorecovery signals (all 1's) in the received data that we do not get with the older driver, so we went back to our original driver.
We interface with the T7 using Ethernet through a network switch. We start the stream using the eStreamStart command, with 100,000 scans per read, 1 address, and FIO_EIO_STATE as the address. The periodically sampled inputs are CIO0 and CIO1. The occasionally controlled output pins are MIO0 and MIO1.
We have other systems up and running that do not exhibit this behavior. We have not found anything that changed between the replacement of the damaged parts and the first error message. The system is 1000 miles away from us. We do have remote access, but not physical access without getting the customer to perform that work.
We ran Wireshark on the system to see what was going on between the T7 and the PC. In this one case we found that instead of transmitting a new packet 5 mS after the previous, the T7 retransmitted the previous 2 packets again, after almost a full second delay (See attachment Wireshark capture 20180107_202356 Stream Spurious Retransmission.PNG) . The T7 then streams the packets out without any delays (less than one millisecond between packets) until it catches up. Shortly after catching up, the PC transmitted a Modbus command writing to the STREAM_ENABLE register with the value 0, thereby stopping the stream (See attachment Wireshark capture 20180107_202356 Stream Disable Message.PNG). The driver then returns the STREAM_AUTO_RECOVER_END_OVERFLOW error in response to our call to eStreamRead. This causes our software to reopen the connection and restart the stream. Our code is in a loop, that does not have any apparent means to generate the stop stream message, so we're wondering if there is anything in the driver that could do it and under what conditions it would.
Over time we do occasionally hear about machines where the network does not always operate as expected and there have been driver and firmware changes to handle these things when we can identify them. So my first suggestion is to use the latest installer from LabJack-2018-08-30 and then use Kipling to update to firmware 1.0270. Makes sense to be running these before we start digging into the traffic capture.
If you still see problems, try running LJStreamM.exe rather than your software. You can't do the command-response reads or output control, but can at least see if the basic stream of 1 channel at 100k has any issues.
If you still see problems, please try running this installer for a development version of LJM (soon to be released as LJM 1.1901) then try again:
http://files.labjack.com/temporary/LabJackMUpdate-1.1900-dev.exe
Note that LJM will send the stream disable packet when it encounters an unrecoverable error. STREAM_AUTO_RECOVER_END_OVERFLOW is one such error.
In your Wireshark capture 20180107_202356 Stream Spurious Retransmission.PNG, I see 0B7F indicating error 2943, which is 2943 STREAM_AUTO_RECOVER_END_OVERFLOW. This means autorecovery is happening for so long the device cannot count the number of missing samples.
You can try the techniques here to reduce the occurrence of auto recovery:
https://labjack.com/support/software/api/ljm/streaming-lots-of-9999-values
You could also use a direct Ethernet connection, which reduces much of the latency of the network:
https://labjack.com/support/faq/how-do-i-connect-t7-directly-my-computer...
For the sake of inserting missing scans where they belong, autorecovery currently relies on the first channel of stream being an analog input:
http://labjack.com/support/datasheets/t-series/communication/stream-mode...
However, since your first stream channel will never return all 1's, we can release a version of LJM that will allow your first channel to work with autorecovery.
I will update this thread when that version of LJM is released.
Thanks for the quick feedback.
We are working to get the new driver working on our test setup now before we can deploy at the customer location. I'll advise on how that works once we are done.
I have two new questions based upon your responses.
1. The pattern I see for the stream data is the T7 transmitting 2 packets, followed by the PC transmitting an ACK. Rinse, repeat. What would happen in the T7 if it didn't get the ACK. From the Wireshark capture 20180107_202356 Stream Spurious Retransmission.PNG, it appears that it would stop transmitting, until either it gets the ACK, or we have a TCP timeout event that causes the packets to be retransmitted, and that would occur approximately 1 second after the second packet was sent. Is this correct? If so, is it possible to change this timeout value?
2. Your response about the autorecovery signaling, in my Wireshark capture 20180107_202356 Stream Spurious Retransmission.PNG. I see the first data sample as 0xFF 0xFF which your documentation states is the autorecovery signal. LJM passes this onto our application and we toss it. Is your change to LJM to duplicate this signal n times, where n is the number of missed samples? Where in the data packet would that number be transmitted to LJM (bytes 14 & 15?)?
1. I'll investigate this.
2. The 0xFFFF is actually the "number of skipped scans", except that since autorecovery overflowed, it's more than that. You're right that the change would be to duplicate the signal n times, where n is the number of missed samples. The number of missing samples is the two bytes after a 0x0B7D status.
Per your suggestion in message #2, I have upgraded our internal test system as shown below:
Hardware:
Manufacturer
Axiomtek
Processor
Atom CPU E3845
Clock
1.91 GHz
Installed RAM
4.00 GB
Usable RAM
2.89 GB
OS
Windows 8.1 Pro
OS bits
32 bit OS / 64 bit processor
Version
6.3 (Build 9600)
Kipling 3.1.14
LabJack Device:
Model
T7
Serial #
470012192
FW Version
1.0255
Bootloader Version
0.94
Recovery FW Version
0.6602
Connection
Ethernet
I held off upgrading the board to the Beta firmware 1.0270 for the moment to get a baseline using your current stable release.
I ran your LJStreamM (V1.06) configured with 1 channel, FIO_EIO_STATE at 100,000 Hz Scan Rate, shortly after I hit Start Stream, I start a Wireshark capture with no capture filter. The stream ends with LabJack Error #1301: LJME_LJM_BUFFER_FULL error as shown in the LJStreamM screen capture attachment. I'm skeptical that the termination was due to a problem instead of hitting a limit of 50,000 scans read, but I defer to you here.
In Wireshark the stream is looking good until we receive the stream stop message from the driver 500 uS after it acknowledges the previous 2 packets.
Does this provide any additional insight?
One additional question. In the screen capture, the # of DeviceScanBacklog is 19.00. I'm assuming this is the backlog bytes. In Wireshark, bytes 10-11 of the stream messages consistently contain either 0x0028 or 0x0029. Why are these values different and am I missinterpreting these bytes?
Also, getting back to post #6, any updates on why the card would stop sending streaming data?
Thanks again for your help in this matter.
Using the same system in my previous post, I closed LJStreamM and started our service. With this LabJack driver we no longer seeing error messages, but we see the "All 1" pattern in our data. In looking at the attached Wireshark capture (TestSystem_FW-1.0255_LJStream-20180830_Our_Application.PNG) we see the ACK messages from the PC returning smaller and smaller window values, until the value is less than 2080 data bytes we normally transmit (line 270: Win=1824). The next stream packets from the T7 transfer 1040 and 784 bytes as allowed by the driver. Then we see the exchange of ACK packets in which the PC indicates it has a receive window size of 0. Once this clears and the T7 gets an ACK with a large enough window size, the T7 transmits a packet with the balance of the previous transfer and the next Modbus stream packet with 1040 bytes. If the delay was long enough, the T7 starts transmitting the backlog bytes as fast as it can until it catches up. In this sample we have a distribution of times between these events from 100 mS to over 5 seconds (longer is better). This matches the distribution of the "All 1" pattern in our data. But no where in the Wireshark data do I see the autorecover flag, so I'm wondering if the driver is inserting the "All 1" pattern into the data in response to these events? And what does it do when these occur?
If we are losing data using the current system in the field and the new firmware and driver, is there a way for us to go backwards to what we had working? The other field systems are running firmware version 1.0146. The driver was LabJack-2015-06-03.exe installer. At the problem location we did install the new driver and then uninstall it and reinstall the 2015 driver, but we aren't confident that the process truly took us all the way back to the original state (i.e. windows driver, registry values, other pieces of flotsum and jetsum these processes leave lying around).
In regards to changing the timeout: The firmware engineer reports that it sounds infeasible. Doing so would probably mess with TCP at the timeout needed to avoid auto-recovery.
Bytes 10-11 indicate the device buffer status in bytes.
When LJStreamM is running, how are the backlogs acting, for both LJMScanBacklog and DeviceScanBacklog? They should both be staying at a fairly low number.
For example, on my machine I see that LJMScanBacklog stays between 0 and 500. If I resize the LJStreamM window, I notice that the LJMScanBacklog spikes to over a thousand or over 10 thousand. I attribute the cause of this being that the thread that calls LJM_eStreamRead not getting enough processor time. Since the LJMScanBacklog can get to 20 seconds worth of data by default, LJME_LJM_BUFFER_FULL essentially means LJM_eStreamRead has not been called fast enough for 20 seconds worth of time.
For DeviceScanBacklog, it stays around 20 or 21 for a wired connection to my laptop. When I switch to a wireless connection on my laptop (still Ethernet to T7), DeviceScanBacklog jumps around a lot, and I sometimes get a STREAM_AUTO_RECOVER_END_OVERFLOW. This makes sense because the nature of Wifi is slower and much more likely to lose packets. Ethernet, on the other hand, only loses packets due to network congestion, as far as I am aware. Is it possible that your network is suffering from congestion or Wifi links? You could test with at direct Ethernet connection (same link as above) to eliminate the network problems. If that works, you could work towards testing various components of your network until the problem is found.
I will respond to your other questions on Monday.
In post #8, you say you get all 1s without the stream autorecovery end flag. That doesn't make sense to me. Are you searching the whole capture for the bytes "0B7D"?
In regards to rolling back, yes, you should be able to simply run whichever installer you need. Each installer installs its own uninstaller and when you run an installer, it will run the previously installed uninstaller. So, you should be back to how it was. Why are you questioning whether everything was rolled back? In regards to LJM, registry values don't matter. You can check LJM's version by calling LJM_ReadLibraryConfigS(LJM_LIBRARY_VERSION, ...). You could also check that the constants file is has a reasonable date version for 2015.
I do recommend updating fully to the latest software and firmware versions though, since I know some intermittent errors have been fixed.
It sounds like you have two root possible causes of the all 1s issue you're seeing. One is network congestion (causing a lost ACK) and the other is a zero-window issue.
1. Network congestion: To avoid this, you can isolate parts of your network part-by-part to find where the problem emerges. For example, if you can stream from the T7 directly connected to the computer via Ethernet without problems, you can add network complexity until the problem occurs.
2. Zero-window: What is your CPU usage for the streaming process? I'm able to essentially cause zero-window to happen by enabling the debug log in LJM.
It seems like macOS is worse at avoiding zero-window events, but what is the CPU usage for the streaming process?
I was wrong in post #8. The stream autorecovery end flag was not in the packet I thought it would be in, so I missed it. Using our test system with the latest stable LabJack firmware and software, I am seeing the zero-window issue. Once the window becomes greater than 0, sometimes we recover without any loss of data. Other times we will get subsequent packets with the status bytes "0B7C". Then a single packet with "0B7D". That packet will then have additional status information of something like 0x5287. I assume this is the # of samples that were lost. This packet is then followed with a packet with the first sample set to 0xFFFF. With this latest LabJack software, our process does not log the "STREAM_AUTO_RECOVER_END_OVERFLOW" messages we were logging with the older LabJack software.
In our field units that are working correctly (same hardware, with your 2015 firmware and software), the CPU utilization on our process is approximately 7%, and our network utilization for this process is 1.7 Mbps (almost exactly 100,000 samples per second from the T7). For these system we do not log any errors. Total CPU utilization is around 16%, Memory around 35%, Disk around 0%, and Network utilization around 1%.
For the test system running our same code, with the 2018 firmware and software we see our CPU utilization at 50% which we believe is due to the increased logging in our system, which logs entries when any data channel changes state. This occurs when we get the transition from normal data to 0xFFFF and back again.
In our field system that is exhibiting the improper behavior we have dropped the sample rate to 50,000 samples per second, and the number of samples per read to 50,000. This has reduced, but not eliminated the problem with this system. Here our CPU utilization is around 2.8% for our process and 0.8 Mbps for process network traffic. Just as you'd expect. Our overall CPU and Network utilization is also half the other systems due to these changes.
For our systems, these numbers represent an "idle state" in which we are just looking at these samples to determine if a trigger event has occurred.
I'll need to go back to some of the older Wireshark captures on the misbehaving field system to see if there are other scenarios that differ from this one.
I'll also need to look at how we handle the returned error code to see why we are not logging the "STREAM_AUTO_RECOVER_END_OVERFLOW" error when using your new software.
On your field system, are you getting zero-window events? It sounds like this may not be the case since your other systems are fine and since the CPU usage is low.
Are you getting retransmissions? This is harder to guess because network congestion happens in bursts. When I run Wireshark for a local stream at 100kHz with no retransmissions (using the tcp.analysis.retransmission filter), I see that most of the packets are from the T7 (e.g. greater than 95% of packets). This indicates that overall the network is not extremely busy.
Also, what are the backlogs doing? See my post #10.
The field system and office test systems are indeed behaving differently. You are correct, the field system does not exhibit the zero-window events. The ACK returns a constant value for Window, indicating that we are reading the buffer faster then we are filling it. The Wireshark captures show the field issue as a potential miss of an ACK by the LabJack board. The board transmits 2 packets, the PC ACKs it, but the board fails to transmit the next packet on time, and does nothing until about 1 second later, when it retransmits the 2 last packets. This is the failure. During this time, we see no significant increase in network traffic. Like you, the vast majority of our network traffic comes from the T7. That was why I asked about changing the TCP timeout value in post #5. I understand this is not possible from your response. With respect to the backlog bytes, I typically see values in the 0x26 range, and that doesn't change prior to the transmission cessation.
Is there any other reason aside from not getting the ACK that would cause the T7 to stop sending data and then start with the retransmission of the last 2 bytes about a second later?
On Friday we changed the sample rate of the field system from 100,000 samples per second to 50,000 samples per second. We also changed the samples per read from 100,000 to 50,000. The hope was that by reducing the amount of traffic and the time between packets we could eliminate the problem. This did not resolve the issue.
We also sent a new replacment network switch to the customer in case the switch was damaged back in the early fall. It was installed in the field on Tuesday afternoon and we had no issues Tuesday overnight and early on Wednesday. However, before we could do the happy dance this morning, we found 5 instances of the "all 1" data pattern from yesterday and overnight.
Our plan right now is to test a 2015 version T7 and return that to use in the field to replace the 2018 version we bought in the fall. Our system is running the 2015 software on the PC and we want to rule out any issues between those versions that could be causing this.
We are unable to upgrade the field to your lastest code, since our office test system running that is unable to keep up with the data being streamed by the T7. We would need to resolve that issue, which is currently secondary to getting the field back to the level of performance we had prior to this issue manifesting itself in mid December.
I assume you mean it resends the last 2 packets as you mention above. (With the old firmware.) That does sound like a lost ACK, or theoretical firmware bug. There have been numerous stream-related firmware fixes, so it's important to upgrade. Further, we cannot fix any bugs for you unless you're using a recent firmware and software.
It's worth noting that the T7 has a 64-byte buffer on the 702 stream port. If that buffer becomes full due to e.g. a simultaneous broadcast of multiple packets, the T7 would not read the ACK. (All network devices work in this way, of course, but the 64-byte receive buffer on the T7 stream port is likely to be one of the smaller receive buffers on your network.)
You mention you replaced the network switch. Is that the only piece of network hardware between the host computer and the T7? Have you replaced the Ethernet cables?
In your test setup, can you replicate the situation where the T7 retransmits after 1 second?
Why is your test system unable to keep up with data using the new firmware and software? What is happening with the device backlog and the LJM backlog?
LJM 1.2000 adds the function LJM_GetStreamTCPReceiveBufferStatus, which should help diagnose similar issues without Wireshark. If you have any further suspicions about a zero-window problem, you can use LJM_GetStreamTCPReceiveBufferStatus to programmatically detect that the window size is decreasing. You can also set the window size using LJM_STREAM_TCP_RECEIVE_BUFFER_SIZE.
LJM 1.2000 is currently available as a beta installer:
https://labjack.com/support/software/installers/ljm/
Edit: fixed link.