loading
 
FRC Network table issue
Pete from United States  [56 posts]
11 year
Hi Steven,

Our team is trying to use network tables, we have been able to send data to the table for the most part, but have been receiving this message which roborealm seems to have an issue.  When sending data to the table a popup window appears saying "Unknown message ID network table X" where X is a changing nimber.  When it appears you select 'ok'  but numerious occurances of the same window appear with changes in the X value.  Once all are cleared it may run fine for 10-20 seconds and then the meassges begin appearing again.  

My suspicion is the newtork table is responding back to roborealm and it does not understand what it is receiving.  Do you have any idea what is going on, is there some log file to help determine what is cauing this message?

Thanks, Team 706
-pete
Steven Gentner from United States  [1446 posts] 11 year
Pete,

Yes, those are messages coming from the CRio that it does understand. There are a couple of reasons why this may be the case.

1. Update RR. Not sure what version you are running but a quick upgrade is always recommended. Use the same link as you used to originally download RR. It auto-updates to the current version.

2. What version of network tables are you running on the Crio? Be sure to have updated any changes from Netbeans (assuming you are using Java) or whatever platform you are using as there are changes to the NT from the CRio side too. Most people running an older Network Tables from last year would see these messages.

3. Ensure that you are running .47 firmware on the CRio.

4. Finally, try also running either the TableViewer (typically located in C:\Users\Your username\sunspotfrcsdk\tools) and do NOT enter any hostname. This will cause tableViewer to act like a Network Table server. Point RR at this IP address and see if you have any problems. If you have the right version all should be ok. That will  ensure that RR and the NT on your PC are ok, then we can resolve any networking issues that may occur.

STeven.
Pete from United States  [56 posts] 11 year
Hi Steven,
Thanks for the quick response as you are aware building and programming a robot in 6 weeks is not a trivial task and when working with high students it puts an interesting twist on things.

Some additional information, on Tuesday they were able to read back the “BFR_CORRDINATES” but noticed the cRio displaying messages about a client connection error and I believe communications to the network table shut down but restarted, but this was not always happening.  This was happening multiple times when RR was running.  

On Wednesday evening I worked trying to discover what the issue was.  Stopping RR script from running would cause the errors to stop, therefore this was confirmation RR networks tables seemed to causing the issues.  I thought possibly sending the 8 values in the BRF_CORRDINATES might have something to do with the errors.  I modified to the VBscipt to create a new variable V_POS which I then sent to the table by just changing the BFR_CORRDINATES variable to V_POS instead, in effect just sending a single variable.  When run the errors stopped and I thought I had discovered the problem.

Yesterday at our meeting I told them I think I discovered the error but only find out it was still there.  I was using  the same computer, cRio and camera as the day before.  Once I thought I had things were working I never reset everything to confirm all still worked, it was getting late.

I was using RR ver 23 and updated Thursday to discover now at ver 25 and still had the same issues.  We are programming in C++ and have the latest C++ updated from 1/18/2013.  Also cRio is imaged with V47.

I will try your step 4 as listed below.

Are there any debugging tools / log files to help see what is going on?

Could it be sending data as text vs double to the table?


Thanks,
-pete
Steven Gentner from United States  [1446 posts] 11 year
Pete,

Thanks for the information. Could you post the NetworkTAbles related C code so that we can try to replicate the issue?

Yes, it is possible that doubles are being sent as text. We'll look into that potential issue.

Its also possible that the RR NT module is sending something that the CRio NT system does'nt like and it shutting down the connection when that happens. We'll check into that too.

STeven.
Steven Gentner from United States  [1446 posts] 11 year
Pete,

After some testing on how things get started there is another problem that might effect your situation. The problem is one of type.

Once you first startup RR there may or may not be a target in view. If not, the BFR_COORDINATE array does NOT get created. Thus its type is essentially unknown. When the NT module sees this, it defaults to an empty String/Char type. When a target finally gets into view the BFR_COORDINATE is now defined as an array ... but we're too late! The NT spec does NOT provide the ability to change the type from String to Double Array so we are stuck with a null/empty value until the NT system is reset.

So now (after the current fix), instead of defaulting to an empty string, if the NT module sees a variable that is not set it just ignores it. When it does get set it will lock in that type and only after that send null/empty values since the type is now known.

This seems to clear up the startup issue which may have been an issue for you and explains why things worked by when reset, it didn't.

v2.48.26 has this update.

STeven.
Pete from United States  [56 posts] 11 year

Hi Steven,

I have downloaded latest rev 2.48.26 and still have the problem.

Here is the sequence of events.  RR is closed. Table Viewer is closed.  Smart Dasboard is closed.  Driver Station is closed.  cRio is rebooted.  This assures that the NT has been cleared.

I have changed RR to only write IMAGES to the NT and then closed RR.

1.) Driver station is started which also brings up the smart dashboard.
2.) Table Viewer is started showing disconnected.
3.) Progrm is downloaded to the cRio, after which the NT table is connected and gets populated.
4.) RR is started after which the following is displayed in the NetConsole window.

-> IOException message: End of File
0x10c7058 entered connection state: CLIENT_DISCONNECTED
Close: 0x10c7058
Task exited normally: Server Connection Reader Thread
task 0x10ec718 (FRC_Server Connection Reader Thread) deleted: errno=9 (0x9) status=0 (0)
0x10c7058 entered connection state: GOT_CONNECTION_FROM_CLIENT
Starting task: Server Connection Reader Thread
write error: : S_errno_EPIPE
IOException message: Could not write all bytes to fd stream
0x10c7058 entered connection state: SERVER_ERROR
Close: 0x10c7058
Task exited normally: Server Connection Reader Thread
task 0x10ec8b0 (FRC_Server Connection Reader Thread) deleted: errno=9 (0x9) status=0 (0)
0x10e6ed0 entered connection state: GOT_CONNECTION_FROM_CLIENT
Starting task: Server Connection Reader Thread
0x10e6ed0 entered connection state: CONNECTED_TO_CLIENT
IOException message: End of File
0x10e6ed0 entered connection state: CLIENT_DISCONNECTED
Close: 0x10e6ed0
Task exited normally: Server Connection Reader Thread
task 0x10ec9c8 (FRC_Server Connection Reader Thread) deleted: errno=9 (0x9) status=0 (0)
0x10e6ed0 entered connection state: GOT_CONNECTION_FROM_CLIENT
Starting task: Server Connection Reader Thread
0x10e6ed0 entered connection state: CONNECTED_TO_CLIENT

Several more of these messages appear and then the window pops up Title of Error with
message Unknown message ID from network tabkles:X

Attached is RR script file.

Hope this help to diagnose the problem.

Thanks,
-pete
program.robo
Pete from United States  [56 posts] 11 year
Hi Steve,

Here is some addional information.

When using the table viewer with the laptop as  the server, no popup window errors with the Unknown message ID DO NOT appear.  I can not tell if any other error occur that are seen in NetConsole when using the cRio as the NT server.

We have created a new project using the command based code example.  No changes were done in the example project.  Project was complied and sent to the cRio.  When it started the NT connected and populated.  Then started RR with the script sent earlier that just sends the IMAGE COUNT to the table and the errors occured.

This seems to point to an issue with NT running on the cRio.

Based on this you should be able to duplicate.
RR at ver 26
cRio image V47
Latest C++ update from 1/18/2013

Is there any thing else I can try to resolve this issue.  Wondering about going back to the previous C++ update.


Thanks,
-pete
Steven Gentner from United States  [1446 posts] 11 year
Pete,

Thanks for the information. It does seem to point to something within the C++ version of the network tables on the CRio. Unfortunately, we've hit a bit of a snag in that we don't have the WindRiver C++ compiler for the CRio. So instead we will try to get a C++ version running on the desktop and hope that this also exhibits the same errors.

We did verify that the Java version still seems to work without errors on the CRio. Most of the teams we have worked with are either using Java or LabView on the CRio so its possible that this hasn't come up before due to that.

It is possible that even the TableViewer would exhibit these errors if you could make many updates per second which is what the RR module does.

One quick test, can you add in a Timer module and set it to only continue every 1 second and add that just before the Network Tables module? This will force the NT module to only send out updates once a second. This will test to see if perhaps the NT system on the CRio is just getting too many requests and losing track. Its not a solution but just another test to help narrow down the issue.

I'm not sure if reverting back to a previous version would help. The network tables 2.0 are new this season. The C++ version was one of the last platforms after Java and LabView that were developed.

We may also add some additional logging to the NT module later today ... more on that later.

STeven.
Pete from United States  [56 posts] 11 year
Hi Steven,

Thanks for all the help.  Looks like we have discovered the issue.  I was trying slow things up by changing the camera refresh rate which was not the correct way, using the timer module as you suggested help lend insite into the problem.  It seems that the cRio processor jumped to 100% when RR started and could not keep up with the NT updates from RR.  By adding timer delay time to the pipeline seems to correct the errors we were having.

Can you confirm the following, adding the timer module will that effect FPS of the image being displyed I would assume it does.  If so is it possible to keep the FPS high but only slow the NT update, asked by one of the students?  

I have the prgram running on this system as I write this response and I see that the cRio processor is running about 90% from 50% when RR was not running and just received a couple of Unknown message ID error popups, so it seems some additional delay is needed currently at 25mS with a FPS at 15.

Thanks for all the help, wondering if we should switch to using sockets to pass data instead of NT.

-pete

Pete from United States  [56 posts] 11 year
Hi Steven,

Here's an update.  In an attempt to determine minimiuim delay to set in the timer module, got some interesting results. It seemed to be working with no errors with delay at 25mS, it seemed keep working while lowering the delay by 5mS.  While dropping the delay was watching the cRio CPU usage.  One thing to note that when TV and / or smartdash board was running there was a much larger CPU loading.  Both the TV and smartdash board were quit and this point CPU about 50% with RR not running.

Seemed to be able to get all the way down to 1mS before started seeing errors.  The CPU load was only about 80% and the errors were still coming.  At this point started increasing the delay up but was not able to stop the errors from coming. Quit RR and reset cRio confirmed the NT were cleared by starting TV and the quiting.  At this point could not stop errors from occuring both in the NetConsole and popups.  I had even increased the delay to 50mS.

At this point the computer was rebooted.  Driver station started, shut down smartdash board when it started. At this time started wind river and load cRio was had also been rebooted.  CPU was about 50%, started RR which had delay set to 50mS, CPU jumped to about 75% and ran with no errors for about 5 miutes and then the errors on both NC and popups started occuring.  Was unable to get errors to stop, increased delay to 75mS and took a hit in the FPS which dropped to 7.  This ran for several minutes and errors started occuring again.  Increased delay to 100mS and still getting errors.

Therefore I can not concluded as mentioned in previous e-mail that cRio loading was casusing the errors.  Shut RR scipt down for several minutes and run again with 100mS delay will run for about 30-40 seconds and begin getting errors but this is intermmitent.  When the RR script is started see only 10-15% increase in cRIO CPU usage to about 70%.

Not sure what is going on without some additional tools and logging as results are inconsisent.  Will seem to run good for awhile at 75mS delay but will get intermmitent errors.  I wonder if this is something the folks at WPI need to take a look at.

Thanks,
-pete
Steven Gentner from United States  [1446 posts] 11 year
Pete,

Thanks for the detailed analysis. That's very helpful to have context to the problem.

I seem to remember reading that the NT transmissions are grouped in 100ms periods in order to reduce network traffic ... no mention of load on the CRio. So what we will do is add a delay field within the NT module which will only send out updates very X ms. This will NOT delay the fps but will delay the rate at which the CRio gets updates. At 10 per second they claimed that this is sufficient ... which seems to be the case depending on what you are doing.

What's strange is that we have not heard of this timing issue with the Java version .. I'd expect the issue to be there rather than on the C++ version which should be running at top speed.

How are you monitoring the CPU utilization of the CRio as we'd like to perform those same tests but with the Java version to see if that is also an issue that we've just not heard about.

Thanks,
STeven.
Pete from United States  [56 posts] 11 year
Steven,

The cRio CPU usage can be seen in the latest driver station under the Charts tab.  

I have trying to dig into the packet transmission between the computer which has the Driver station running and RR using Wireshark.  Needless to say this is like looking for a needle in a hay stack.  Idea was to see if I locate the difference between good and bad transmissions.

Wondering if you could check with FIRST and get a copy of windriver.  Do you know of any FIRST teams in the area, you possibly get a copy from them to help diagnois this problem.

Just using the command based example with actually no additional code change.  It just starts the NT on the cRio and then RR sends data to it.  

Just had a thought I could send you the .out file that gets loaded on the cRio it looks to a 6.7M file, when the cRio boots it tries loading a file FRC_UserProgram.out on startup.  This would start up the NT and you could test send data to the cRio NT.

I like the idea of controlling how fast RR updates the NT.  Running at 30FPS new data would only be available every 33.3mS this does not acount for any RR processing.  Even if one could get 60FPS that is only 16.6mS.

So right now with no timer delay how fast would RR be sending data to the NT?  

-pete
Steven Gentner from United States  [1446 posts] 11 year
Pete,

Yes, if you can zip that up and try posting it here or use the contact for an send us your email we can try that out.

Interesting experiment, we reconfigured the CRio for C++ without any robot code and the CPU utilization is already at 25%. It was flat a 0% for the same Java configuration! Something strange is going on here ...

The NT module now has a delay timer on it that you can use to slow down sending of data but not change the FPS. Currently the send speed would be that of the fps ... 33ms for 30fps. So most of your tests below 35ms probably did not make any difference.

STeven.
Pete from United States  [56 posts] 11 year
Hi Steven,

I tried the new version on a new laptop at our meeting today.  All worked well with no errors using a NT delay of 40mS.  Would still like to try on the laptop where the errors were occurring. After all the testing with the initial laptop I just feel there is something on this system that seems to be effecting the communications.

I was trying to analyze the cRio CPU usage using the driver station log file as the chart was indicating off the chart.  Was not able to get log file data to check CPU loading at this time, this is another issue.

Thanks for all the support,
-pete
Steven Gentner from United States  [1446 posts] 11 year
Pete,

(Addendum to my email) I wonder if the Windows firewall or defender are interfering with the communication ... or perhaps a lose connection? Try disabling the firewall and test. If you are using wired ethernet try wiggling that about and see if the error increases ... or literally shake the laptop to see if something is lose (not too hard though!).

Last year we had to disable the firewall to even communicate with netbeans to the CRio ... but this year that's not needed (different laptop).

STeven.

Steven Gentner from United States  [1446 posts] 11 year
Pete,

There is another case of this same issue on the CD forums also using C++ so the problem is definitely not isolated to your setup.

When you have a chance, can you download the most recent and see what additional messages you see in the message box. We moved away from the popup since that can cause the module to wait for user input in the middle of a contest and added some additional logging to provide some more context to the errors.

Thanks,
STeven.


Anonymous 11 year
Hi Steven,

Will give it a try tomorrow.  I have been having other network issues with the initial computer and have not seen any issues with the new system, but I have to confess don't have much run time with the new system.  Tomorrow it should get a lot of run time.

The initial system is my work laptop and there is a ton of security programs running on it.  They have setup it up so that I can not disable the firewall.  We used to use Sophos but have switch to MacAfee.

The operation was just strange, no set pattern would just throw random errors either seen in NetConsole or the popups.

-pete
Steven Gentner from United States  [1446 posts] 11 year
Pete,

Another data item, the CRio C++ WPILib was updated to fix an array bug in the Network Tables area. So its also worth ensuring that you get all the current updates for the CRio/WindRiver too. This would have directly affected the sending of the BFR_COORDINATES or any other array created by RR.

STeven.
Pete from United States  [56 posts] 11 year
Hi Steven,

I have updated to latest version of RR 2.29.8 and  the latest C++ update 2nd mid season update 3615.  Re-built the C++ project on the laptop that I was having issues on.  I  don't see any popup errors but this system still has problems,  In NC I get numerious messages, read errors, write errors and the server thread task exits and then startes up again seen in NC.

RR messages in NT says Read Failure! Disconnected from server! along with Unknown message ID from Network tables, the majority seem to be Read Failures.  This would happen as NC indicated Server errors.

I will try on my other system as this one seems to have a ton of security stuff that I'm not able to turn off, (work computer).  Not sure if this a hardware issue or software as I had all the boards replaced several weeks ago and I also seem to have to some network connection problems at work both wireless and wired ,nothing I put my finger but I see intermiitent slow response sometimes.

-pete

Pete from United States  [56 posts] 11 year
Hi Steven,

Here's an update.  Using a different laptop, update with RR 29.8 latest C++ update and latest driver station.  Things are working much better.  The only message in NT message box is 'Initial Sync completed' and no scrolling messages in NC.

The driver station charts are working with the latest DS update.  CPU at about 10% and jumps 15% to 25% when RR is started and NT are updating.

One thing to note is if you make a change to the Send Delay in NT when running NC will display some error messages, some look a little gibberish but seems to recover.  NT message will display 'Hostname or port changed' and another 'Initial Sync completed' message. Also if RR is stopped and a change is made to the Send Delay this is displayed in NC

-> IOExceptio0nx 1m0eds4sca0g0e :e nEtnedr eodf  cFoinlnee
c0txi1o0nd 4sft9a8t ee:n tGeOrTe_dC OcNoNnEnCeTcItOiNo_nF RsOtMa_tCeL:I ECNLTI
ESNtTa_rDtIiSnCgO NtNaEsCkT:E DS
eCrlvoesre :C o0nxn1e0cdt4ifo9n8
RTeaasdke re xTihtreeda dn
ormally: Server Connection Reader 0Txh1r0eda4dc
0task 00x 10db840e (nFRC_Server Connection Reader Threadt) deleted: errno=e9r (e0xd9 ) status=c0o (n0n)
ection state: CONNECTED_TO_CLIENT

For the most part things seem to running.

-pete

This forum thread has been closed due to inactivity (more than 4 months) or number of replies (more than 50 messages). Please start a New Post and enter a new forum thread with the appropriate title.

 New Post   Forum Index