Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Best method to access shared memory remotely
I am wondering if anyone constructed an application like ours on the Power PMAC successfully.

We have a fairly complex system with many axes that are being controlled from within the Power PMAC environment. However, we also have a large number of EtherCAT IO points handled by the Delta Tau. Most of these are used for more generic machine control tasks such as valves and other process control items.
At this point, we wrote a separate C++ task in Xenomai that handles these IO points by directly interfacing to them through the PMAC shared memory pshm->Ecat[0].Io[n].etc. That seems to work very well, and besides the IO interface, this C++ code acts as the "master" in all higher level motion control-related tasks.

The difficulty seems to come into play when this code, which directly talks to our UI over TCP/IP, rapidly sends data back & forth. This causes Mode Switches in Xenomai, which, in turn, seem to lead to instability of the entire PMAC. We have had scenarios where the PMAC becomes unresponsive after a half hour or so and needs to be power cycled. Not acceptable in our system.

From what I read about these Mode Switches (MSW; cat /proc/xenomai/stat), they're hard to avoid if you are doing realtime stuff (reading & writing to the pshm-> structure) and also doing TCP/IP or file tasks that involve the generic Linux OS.

At this point, I'm looking for suggestions on solving the source of this issue. I can move my C++ code to another system (i.e. not running on the Xenomai kernel on the PPMAC), but then I need fairly rapid (1 ms) access to the entire pshm-> structure somehow from outside the PMAC.

Or.. I need another way to shuttle data from one side of the fence -- Xenomai -- to Linux (and back) so it can be sent over TCP/IP without incurring a mode switch.

Any and all suggestions are welcome!
rvanderbijl: Great question. This is something my company bumped up against many years ago when we started using Power PMAC. The short answer is: yes, it is definitely possible to avoid mode switches with this type of architecture.

When you say "shuttle data from one side of the fence -- Xenomai -- to Linux (and back)" this is exactly what shared memory is for. Accessing shared memory does not incur a mode switch (it will incur context switches, but not mode switches, labeled MSW in /proc/xenomai/stat). But you need to ensure that your process running within Xenomai's primary mode is not the thread that is doing the TCP/IP communications. This is because each call to the OS network stack switches the current thread from primary mode to secondary mode.

Therefore you need to separate your real-time code from your TCP/IP code. You can communicate between these two tasks using 1) DT Shared memory, 2) allocate your own shared memory, or 3) implement a message queue. Here is a function I use to switch my realtime threads to primary mode:

For Xenomai native task library:
#include <xenomai/include/native/task.h>

task_switch_to_primary( void ) {
// returns <0 if thread did not switch to primary mode
    return rt_task_set_mode(0, T_CONFORMING, NULL);
    return rt_task_set_mode(0, T_PRIMARY, NULL);

For Xenomai posix task library:
#include <xenomai/include/posix/pthread.h>

task_switch_to_primary( void ) {
// returns <0 if thread did not switch to primary mode
return pthread_set_mode_np(0, PTHREAD_PRIMARY);

Also make sure that you are calling mlockall() while initializing your realtime task:

err = mlockall(MCL_CURRENT | MCL_FUTURE);

One of our original concerns was that making our TCP/IP loop a background (non-realtime) thread would incur serious performance penalties. So far this hasn't been a problem, and using a polling architecture with SELECT() we are able to get 250+ microsecond updates between Ethernet packets (depending on payload size), albeit with marginal jitter depending on CPU load. 1ms updates should be no problem at all as long as the target on the other end can handle it (we did testing a while back and found that for Windows-to-Power-PMAC we could do ~1-2ms updates with some jitter depending on the Windows machine CPU usage, but for Power-PMAC-to-Other-Realtime-CPU we could do ~250us updates reliably).

I would also be more concerned about the unresponsive Power PMAC. I have never had mode switches cause a Power PMAC to become unresponsive (even with code that switches mode many times each second), but I have seen this problem when I didn't allocate memory properly or when I would try to do "funny" things in kernel mode (typically as part of user servo or user phase code). You may already know this, but if you plug a serial cable into the Power PMAC it will give you some dump information when the Power PMAC crashes and this can help you troubleshoot where the problem is occurring.
Thanks for your detailed response!
I did not know about the dump data when the Delta Tau crashes. Unfortunately I don't have a PC nearby, but I will set one up to capture any crashes (of course, when I was there with my laptop for 4+ hours, it never crashed.. :) ). I have also witnessed at least 2 crashes while my C++ code was NOT running. Just the Delta Tau PLC and Motion code. There is a small chunk of C code in there, but I can't imagine it is responsible for a crash (custom user phase routine, nothing exciting in there).
Also, it appears when this happens, the entire system becomes non-responsive (no response on terminal sessions or pings), but after 2-3 minutes, it's rebooted itself. Looks like the watchdog must have kicked in at that point...

Curious -- Can you have a thread inside a real-time task run in primary mode? Or does it need to be a separate task? I currently have a thread dedicated to communication over TCP/IP that already uses my own chunk of shared memory to talk to the other threads responsible for controlling the system.

Also -- Have you had any experience with the dual core PPMAC? I tried to change the CPU affinity to run the task on the other (much less busy) CPU, but then I incur mode switches for every access into pshm-> memory... That caused a consistent crash after 30 mins or so.
Update --

I implemented a new task with the code that shansen suggested. I'm still seeing a ton of mode switches, but I think I may have an idea where they're coming from -- As part of accessing the shared memory, I have events and mutexes between the realtime and the TCP/IP task to synchronize the two of them. I suspect the rt_mutex_acquire calls are forcing it back into real-time mode and then the subsequent socket write back out.

shansen, how did you solve the mutex issue in your application (assuming you use them)? I googled it a bit, but haven't found any hints yet..
rvanderbijl: Each thread is run separately and is either scheduled by the real-time kernel (Xenomai) or by Linux (background, non-rt). So it is assigned at a per-thread basis and you cannot mix realtime and non-realtime threads.

I have some limited experience with the dual core CPU, but probably not enough to help with your cpu affinity issue. I know that DT has configured the OS to run in SMP mode so Xenomai schedules threads to run on both CPUs. The end result is that Xenomai should be automatically selecting the best core for each thread and you shouldn't have to manually assign affinities.

rt_mutex_acquire is part of the native Xenomai API and definitely should not result in a mode switch. The best way to debug is to use rt_task_inquire (or the posix equivalent if you are using the posix task API) to figure out where your program is switching back to secondary mode. Note that if your task is running in secondary mode (non-realtime), then a call to rt_mutex_acquire will switch it back to primary mode, causing a mode switch. So it is most likely that something else in the code is causing a switch to secondary mode and then it switches back to primary when rt_mutex_acquire is called.

EDIT: oops, didn't read your last reply carefully enough. Yes, the socket writes are definitely switching your thread to secondary and then the rt_mutex_acquire is switching you back to primary mode. The solution is to move all socket writes to a background thread. We typically allocate our own shared memory between socket threads and RT threads and use some simple request/acknowledge flags to synchronize data transfer via the shared memory.

Thanks for your reply. I was trying to change the affinity as to not step too much on the DT native processes for servo control that, on my system, run on core 0. Core 1 is mostly idle, and since our task uses about 20-30% CPU, I figured that having it run on the other core should be helpful. But that causes a constant mode switch every time I touch pshm->....

In any case, I'll try to change from events & mutexes across shared memory to a few status bits and see if that makes a difference.

That said, I also ran into something else weird, which I posted on the xenomai mailing list. When I try to access malloc'ed memory (or rt_heap_create/alloc) in a RT thread, I see it switch to secondary mode when I do a memset or memcpy on a block larger than 32 bytes (trying to do it on 320k of memory). If I do a loop of 10k iterations of 32 bytes, it works just fine without mode switches. Not cool... I did get a reply, and apparently this can happen when the hardware TLB misses a memory access due to moving too much memory around.
I have a layer of shared memory that I exchange with our user interface. To limit TCP/IP traffic, I only send changes which means I need to move around 300-400k around to do comparisons and set up blocks to be transferred.

Perhaps with these limitations we need to re-think our architecture....
Interesting, I haven't seen that problem before. But I also typically avoid using memcpy within realtime threads due to performance issues (can cause excessive context switches due to cache misses).

OK, I just set up a quick test and added a memcpy of 10k bytes to and from DT shared memory while in kernel space and it doesn't seem to result in mode switches. Is it possible that you are allocating your shared memory in user space? I believe when you are sharing memory between kernel and user space you need to allocate in kernel space first using kmalloc. Here is one of my routines that allocates shm for this type of application:

Kernel Space code:

#define MODULE_NAME "some_module" // will be created as /dev/some_module

static struct file_operations fops = { .owner = THIS_MODULE, .mmap = vdb_mmap, .unlocked_ioctl = NULL };

vdb_mmap( struct file *filp, struct vm_area_struct *vma ) {
    int err = 0;

    vma->vm_ops = &mmap_ops;
    vma->vm_flags |= VM_RESERVED;
    vma->vm_flags |= VM_SHARED;
    vma->vm_flags |= VM_LOCKED;
    vma->vm_private_data = filp->private_data;

    err = remap_pfn_range(vma, vma->vm_start, virt_to_phys(vdb_memory) >> PAGE_SHIFT, VDB_SHARED_MEMORY_SIZE, vma->vm_page_prot);

    if( err < 0 ) return -EAGAIN;

    return 0;

static int
init_and_allocate_memory( void ) {
    uint32 size;

if( alloc_chrdev_region(&version, 0, 1, MODULE_NAME) < 0 ) {
        printk(KERN_ERR "%s ERROR: Invalid device version.\n", MODULE_NAME);
        return -EAGAIN;

driver = cdev_alloc();
    driver->owner = THIS_MODULE;
    driver->ops = &fops;

    if( cdev_add(driver, version, 1) < 0 ) {
        printk(KERN_ERR "%s ERROR: Could not create a device.\n", MODULE_NAME);
        return -EAGAIN;

    if( sizeof(vdb_t) >= VDB_SHARED_MEMORY_SIZE ) {
        printk(KERN_ERR "%s.ko ERROR: VDB size of %u bytes is greater than shared memory size of %lu bytes.\n", MODULE_NAME, sizeof(vdb_t),

        return -ENOMEM;

    vdb_memory = kmalloc(VDB_SHARED_MEMORY_SIZE, GFP_KERNEL);
    if( !vdb_memory ) {
        printk(KERN_ERR "%s.ko ERROR: Could not allocate VDB memory.\n", MODULE_NAME);
        return -ENOMEM;

    memset((void *) vdb_memory, 0, VDB_SHARED_MEMORY_SIZE);

    return 0;

User Space code:
vdb_open_database( void ) {
    int err = 0;

    fd = open("/dev/some_module", O_RDWR);
    if( fd < 0 ) {
        perror("Could not open /dev/some_module: ");
        return -1;

    sharedmem = mmap(0, VDB_SHARED_MEMORY_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if( sharedmem == MAP_FAILED || !sharedmem ) {
        perror("Could not mmap vdb memory via APC firmware: ");
        return -1;

    if( !is_aligned(sharedmem) ) {
        munmap(sharedmem, VDB_SHARED_MEMORY_SIZE);
        printf("libvdb ERROR: mmap'd shared memory was not 4 byte aligned.\n");
        return -1;

    return err;

I did some hack-y copying and pasting there so hopefully I got everything you need. Essentially the steps are (assuming you are using a char driver in kernel space for your implementation):

1) Create cdev in kernel space
2) Allocate memory with kmalloc in kernel space
3) Setup kernel module for mmap access
4) In user mode use mmap with /dev/your_char_device to get mmap'd access
Hmm.. I'm guessing that I missed some steps then. I figured if I did a rt_heap_create & alloc from my RT Task that I would get memory I could use between RT and non RT threads, owned by RT..

Is it necessary to make a kernel module? And if so (fairly new to Linux here.. bear with me.. most of my background is on the MS side), do I need to do something special to load/use that?
Ok.. I think I just realized something fundamental here after doing more digging.. Apparently the assumption that I call an rt_create_task from my user space code (.out) does not make it a Xenomai RT task. Even though it looks like it does, but I guess it runs in non realtime user space?

Then I guess I have the reverse problem with my mode switches -- I call rt_ functions in my code, which I guess switches it to primary mode, and subsequently back to secondary. And on top of that, where I thought I had a RT task, I don't?? Gotta say the performance of my non RT task is pretty impressive ;)
shansen (and group):
Can someone confirm that when you create an RtTask using RtTaskCreate from a .out (user mode, not kernel mode) program, you do not actually have task running in real-time under Xenomai (as mentioned in my post above)?
I tried changing all the rt_ calls in my code to standard linux calls (pthread_*), but I still see mode switches (and weird crashes, which I'm still trying to figure out).
I would be more responsive but have been buried working on other projects!

Calling rt_task_create does create a thread scheduled using the Xenomai scheduler and therefore this thread will incur mode switches if you use syscalls within that context.

Based on my understanding of what you are trying to achieve, you should have one thread created with rt_task_create that runs your deterministic code, and one thread created using pthread_create that acts as a background thread for handling communications.

Are you using the Delta Tau IDE? I haven't used the IDE in years so bear with me, but one gotcha is that the when the IDE generates Makefiles it automatically wraps pthread_create by the Xenomai equivalent (or at least it used to). This means that if you call pthread_create, the linker will actually link to Xenomai's pthread_create_rt() instead of pthread_create() (making the thread scheduled by Xenomai instead of the Linux scheduler). This could be the source of your problems. Check the Makefile for your communications program and verify that there isn't a "--wrap,pthread_create" command passed to the linker. The realtime Xenomai posix functions are located in libpthread_rt, so your makefile will probably have a "-lpthread_rt" arg passed to the linker as well.

Even with wrapping, this can still be made to work. What you will see is that the first time your background (communications) thread calls a send() function, the thread will switch to secondary mode. But it shouldn't switch back to primary on its own unless you call a Xenomai realtime function, so it should stay in secondary mode for the rest of its lifetime. What makes it tricky is that the IDE might wrap other functions too, so check the makefile to figure out if you are calling any other functions wrapped by Xenomai functions and make sure to avoid them (or remove the wrapping altogether if you are comfortable using a custom makefile). Wraps in the makefile look like this:

WRAP    := -Wl,--wrap,clock_getres           \
           -Wl,--wrap,clock_gettime          \
           -Wl,--wrap,clock_settime          \
           -Wl,--wrap,clock_nanosleep        \
shansen, thanks for your reply. I completely understand being busy, I'm there right now, and under the gun. Hence my slight impatience. My apologies!

That said, I am NOT in the DT-IDE environment for my C++ code (it's C++, I took the DT makefile and hacked it up to work with G++ to build my code). There are plenty of --wrap commands in there. Interesting.. Now I know what these are for!

But most importantly, if I do an rt_create_task, I'm definitely in "primary" mode without having to create a kernel module. That's good.. I'll continue down the path of splitting off the comms and RT tasks (as right now the comms are created from the RT task using a pthread_create, which is likely wrapped).
As I'm still using mutexes, I assume I shouldn't use the rt_mutex_* function calls either, as they would cause a switch to primary mode on my comms thread. Is using pthread_mutex_* ok to use from within the RT task (assuming it's not wrapped of course)? Or would that cause a switch back to secondary mode? If that's the case, I guess I'm stuck putting a few bits in my shared memory map to handle these instead.

That said, is memory allocated using either rt_heap_* from within the RT task ok to use as shared memory between the RT and comms tasks? Or should I still create a kernel module as you mentioned before to create this shared memory?

(06-19-2017, 08:57 AM)rvanderbijl Wrote: As I'm still using mutexes, I assume I shouldn't use the rt_mutex_* function calls either, as they would cause a switch to primary mode on my comms thread. Is using pthread_mutex_* ok to use from within the RT task (assuming it's not wrapped of course)?

Off the top of my head I am not sure. I usually avoid mutexes between RT and non-RT code because it can cause jitter on the RT side. My guess is that rt_mutex_* calls should be used for both your RT and non-RT task, but this will switch your non-RT thread to primary mode each time you call acquire() or release(). I don't know of another way to prevent mode switches other than using a shared memory approach. Typically mode switches aren't that big of a deal for non-RT threads because those threads aren't deterministic regardless. It is the RT thread that you want to ensure doesn't have any mode switches.

(06-19-2017, 08:57 AM)rvanderbijl Wrote: memory allocated using either rt_heap_* from within the RT task ok to use as shared memory between the RT and comms tasks? Or should I still create a kernel module as you mentioned before to create this shared memory?

Running your RT code in kernel mode is only required if you really need low jitter. In rough numbers, I've found that running RT code in kernel mode results in ~10us of jitter compared to ~200us of jitter in user mode (benchmarked on the 460EX). But you should be able to allocate shared memory in either task and share it with the other task without issue if you are using the Xenomai shm api (rt_heap_create, etc).

I swapped things around as discussed, and my TCP/IP task is now running in user space only. Events and mutexes are all going through shared memory, and I'm no longer seeing any mode switches increase on the RT Task(s) or the TCP/IP tasks after first startup.

Hoping that was the main cause of the crashes on the Delta Tau... :)

Thanks again!
That's great, glad you got it working!
Actually -- Crashes were still there. But when I was tracing them using the serial port, I noticed the crashes were happening in sshd and Delta Tau EtherCAT code. Not our code.. Some back & forth, and it looks like the memory on our Power PMAC is bad. Waiting for an RMA, and in the meantime we're on our backup board.

With no crashes at all.....

Happy that our code is running well, but not so happy that I was chasing a bad memory module for over a month. :-/
I know I'm late to this party but this is how I see the best way of handling your problem.

My Assumption is that all of your threads writing to shared memory are running at the same priority level. This is easy to do with CPP because you assign them the priority you want them to have. Be careful of the PMAC facilities such as kernel tasks which will inherently run on a higher priority. I'm not sure how motion progs and "PLCs" play into the game but in general these are not writing to anything that isn't already atomic (i.e. the 64bit CPU writes even a double float atomically).

That said, lets say you have, 5 realtime threads running at the same priority. All of them read and write to shared memory, including arrays, objects, whatever. Since they are all at the same priority you won't have a problem with the other threads preempting/interrupting the write of an array or object, for instance.

Now we add a 6th realtime thread that is going to do the work of detecting what shared memory objects changed and packing them into a format that can be shipped over the network. Instead of writing to the network you will write to a realtime FIFO pipe or message queue. So far we have no worry about thread preemption and don't require any mutex operations since all threads are at the same priority.

Now we have a thread running as a "normal" linux thread (non realtime, secondary mode I belive its called?) that sits there reading the message queue and then delivering it to the TCP connection when data is available. The message queue or realtime FIFO is designed to work this way and will guarantee that you get your data without tramping on the higher priority threads with mutexes and other things that can be problematic in that sort of situation.

You should be able to do the same thing in reverse, that is get data over TCP to a linux thread and then pipe it through a message queue where it then gets read by the same "6th" [in my example] realtime thread and written to shared memory, possibly in between scans for changes or whatever frequency you like.

If I were to start my application from scratch I think I would go this route. I may even change it so that every data "tag" inherits from a base class that has a boolean to indicate when the data has changed so that you can scan through a list of all different types of objects and only look at one boolean to know when to update the object over the network. All write access will have to go through methods so that the "changed" bit can be set when it is changed. This can be pretty efficient. You might need some sort of throttling to handle the data that rapidly changes, but this could be as simple as the period of time you sleep in between data scans, for instance. Going from HMI to PPMAC is probably not going to need much throttling because it tends to happen at "human" speeds (button input, etc).

Our problems were solely due to a bad board from Delta Tau. Once that was remedied, the principal behind it -- RT C++ task in Xenomai communicating with the PPMAC shared memory and its own shared memory buffer for communications to a UI, and a second comms task (TCP/IP) running in secondary mode Linux, talking just to that shared memory buffer, but NOT using events, just bits in shared memory for synchronization works perfectly and has been 100% stable.

There are no issues with RT Tasks accessing the shared memory between primary and secondary mode. As long as you guard it properly using bits in that shared memory, not semaphores or events. Multiple RT tasks running asynchronously should be no issue either (as long as you don't step on data another task is working on, of course...)

I think it's a decent architecture for our purposes, and with exchange times across the wire from non-RT linux of around 50ms, not too bad. That is with changed-data checking as well. Not atomic (tag/variable), but block-based.

We still use around 25% CPU with our task, and I would love to run it on the other core, but we haven't been able to get that to run properly. It runs, but then crashes after 30 mins to an hour.
I see what you mean now, it sounds like a good solution. The 25% CPU sounds kind of high. Is this due to the delta comparisons and saving the last state operations on a large block of data?

I'm in the middle of implementing a "two part" system like this (realtime TCP relay so to speak) with message queue's. This has been in the last week or so (its taking alot of rewriting in my library/api code). So far its been running well. I'm using short text messages to synchronize data over TCP and be able to deal with all different data types and not worry about byte order and data word issues and all of that stuff. Its not as efficient as binary transfer but going to a change of state /event based design helps out alot especially if you have data thats not changing much and "scanned" at a low rate like 50-100ms intervals. Dealing with floating point is the most difficult (significant digits, comparisons, etc).

That's tough about the bad memory. I'm still having issues with some of the early PPC405 units we got but the newer ones seem OK and I'm not sure we have had any with bad memory yet. Knock on wood but the CK3E ARM based PPMAC is running excellent so far (fanless, in my office). ;-)

(09-07-2017, 10:07 AM)rvanderbijl Wrote: Our problems were solely due to a bad board from Delta Tau. Once that was remedied, the principal behind it -- RT C++ task in Xenomai communicating with the PPMAC shared memory and its own shared memory buffer for communications to a UI, and a second comms task (TCP/IP) running in secondary mode Linux, talking just to that shared memory buffer, but NOT using events, just bits in shared memory for synchronization works perfectly and has been 100% stable.

Forum Jump:

Users browsing this thread: 1 Guest(s)