Debugging with Adaptive Partitioning - A Better Mousetrap !

November 30, 2007 by fieldstudy

If you’ve ever tried to debug a process/task on an embedded target, you’ve probably hit this problem: the code you’re debugging goes off into some endless loop somewhere, and eats all of the CPU, so that your repeated (and increasingly more violent !) mouse clicks on the ‘Stop’ button are in vain. Typically, the rogue process is either higher priority than the debugger, or it’s the same priority and running in FIFO mode… The end result is usually a hard reset of the target board, which is a shame, because it’d be really handy to know where the process had got to…

Now that Adaptive Partitioning is included as part of the QNX development seat, there’s no excuse for not using it to address this problem. It’s also one of the best ways to understand and appreciate the value of time-partitioning, and spark your imagination for the possibilities that it might hold if applied to your real system.

Conceptually, adaptive partitioning is quite simple - a partition is an arbitrary group of threads (from one or more processes) that is assigned some percentage of the CPU budget. The total available CPU budget is always 100%, and the sum of the budgets of the partitions in a system always adds up to 100%.

When the system runs, the threads are scheduled using their existing priorities, exactly as they already are if adaptive partitioning is not being used. The one difference is that, as a thread runs, its running time (i.e. CPU consumption) is carefully calculated and subtracted from the current CPU budget for the partition to which it belongs. If that partition budget ever reaches 0 i.e. the partition has consumed all of its budget, a ready thread in another partition (that has budget) will preempt the currently running thread, *EVEN* if the new thread is lower priority than the currently running thread. The clever (adaptive) part is that if there are no ready threads in other partitions, then the running thread (that now has no budget) will continue to run, thus using time from other partitions that those partitions did not need.

So, the way this applies to our debugger case is quite straightforward - we simply ensure that the debugger and associated communications processes (typically just the io-net (or soon to be io-pkt) process in QNX) run in their own partition that has some (usually small) CPU budget associated with it. That way, when you click on the ‘Stop’ button on your host, the message that gets sent to the target will get processed by the network stack and delivered to the debugger, and it will have enough CPU to issue the right commands to manage the rogue process. When you’re not pressing the Stop button, if the process being debugged needs lots of CPU, it can exceed its own partition budget by using the debugger partition’s alloted time, but only until that partition has work to do (the “stop” button being pressed).

So, how do you build a system that does this ?

Make sure APS is built into the kernel.

To do this, you need to change the build file that you use to create the boot image (known as the Initial File System or .ifs file).

This works for any target on any host, but I’ll just show it on a stock X86 self-hosted system for now, from the command line:

To add APS, log in as root, open a shell and type:

cd /boot/build

[Now, you are in the directory where the default build files live.]

As of the 6.3.2 install CD, we’ve thoughtfully added a build file all ready to do what we want - it’s called qnxbasedmaaps.build

Edit this file:

e.g. type:

vi qnxbasedmaaps.build

and you’ll see that line 12 is:

[module=aps] PATH=/proc/boot:/bin:/usr/bin:/opt/bin LD_LIBRARY_PATH=/proc/boot:/lib:/usr/lib:/lib/dll:/opt/lib procnto-instr

The APS part is just the leading:

[module=aps] - so if you add that to any other build file on the line that starts procnto-instr, that will enable APS.

This build file has some extra lines commented out that create and start up a partition, and launch qconn (the debugger agent that we care about) into its own partition. This gives the debugger process some guaranteed CPU time, even if the process being debugged goes into a while (1) ; loop at a higher priority than the debugger.

The lines are:
# Create an example scheduler partition
# Create a 20% p “Debugging”
#sched_aps Debugging 20

# Start qconn in the Debugging partition
#[sched_aps=Debugging]/usr/sbin/qconn

Uncommenting this so it looks like this instead:
# Create an example scheduler partition
# Create a 20% p “Debugging”
sched_aps Debugging 20

# Start qconn in the Debugging partition
[sched_aps=Debugging]/usr/sbin/qconn

If you want, you can change the ‘20′ to any other integer percentage of CPU that you want the debugger to have.

I also like to print a message, if only to remind me that what I booted is a custom boot image:

e.g. I might add

display_msg “Running with a Debug APS partition “

Now we want to start the io-net process and ensure it is in the debug partition too.

You can do that in one of two ways:

1) Start it explicitly using the ‘on’ command (if you know what driver(s) and protocols you want, this is a good way)

e.g. Start io-net in the ‘Debugging’ partition, running the tcp/ip stack and the AMD Lance driver (used by Vmware)

on -X aps=Debugging io-net -ptcpip -dlance

2) If you just want it to run with the default diskboot, it’s a bit more complicated.

First you need to create a small executable file:

Here’s the source code - I called it startnet.c

All it does is run the ‘on’ program, which then starts the io-net process in the ‘Debugging’ partition, and then waits for io-net to start, before moving itself into the background and waiting for 60 seconds.
#include <stdio.h>
#include <sys/procmgr.h>

// start io-net for enumeration purposes, putting it into a partition

main(int argc, char *argv[])
{
system (”/bin/on -X aps=Debugging /sbin/io-net -ptcpip”);
system (”/bin/waitfor /dev/io-net”);

// move to background
procmgr_daemon(0,0);

sleep(60); // wait for 1 minute
exit(0);
}

Compile it with:

qcc startnet.c -o startnet

and then copy it to /sbin

cp startnet /sbin/startnet

Then you need to edit the file:

/etc/system/enum/include/net

This file contains the command to use to start io-net. By default it is:

#
# macro definitions for network
#

all
set(IONET_CMD, io-net -ptcpip)

If you change this to be:

#
# macro definitions for network
#

all
set(IONET_CMD, “startnet”)

then the startnet program will get started instead, when the enumerator wants to start a new network interface.

This slightly convoluted operation is required because every time that the enumerator code decides to start a new network interface, it first looks to see if the program pointed to by IONET_CMD is running and starts it if it is not. If it is already running, then instead of starting a new instance, it uses ‘mount -Tio-net’ to add the new interface to the existing io-net process. If startnet did not wait around, then it would get started again, and start up multiple io-net processes, which is not what we want.

The final thing you need to do is to move the line in the build file that creates the partition, so that it is created before io-net gets started by diskboot.

So, the bootfile lines become:

[+script] startup-script = {
# To save memory make everyone use the libc in the boot image!
# For speed (less symbolic lookups) we point to libc.so.2 instead of lib
procmgr_symlink ../../proc/boot/libc.so.2 /usr/lib/ldqnx.so.2

# Create an example scheduler partition
# Create a 20% partition named “Debugging”
sched_aps Debugging 20

# Default user programs to priorty 10, other scheduler (pri=10o)
# Tell “diskboot” this is a hard disk boot (-b1)
# Tell “diskboot” to use DMA on IDE drives (-D1)
# Start 4 text consoles by passing “-n4″ to “devc-con” (-o)
# By adding “-e” linux ext2 filesystem will be mounted as well.
[pri=10o] PATH=/proc/boot diskboot -b1 -D1 -odevc-con,-n4
display_msg “Running with a debug APS partition”

# Start qconn in the Debugging partition
[sched_aps=Debugging]/usr/sbin/qconn
}

If you are running an SMP (aka multicore) system, you can change line 12 to use procnto-smp-instr instead of the default procnto-instr to make this run SMP.

Now, save the modified build file.

To create the boot image from this build file, run:

mkifs qnxbasedmaaps.build <filename>

You can create a file for later use and copy it to the boot image later, or you can specify the main boot image directly

e.g.

mkifs qnxbasedmaaps.build debugaps.ifs

Will create a bootable image called debugaps.ifs that you can use at a later date, but will not affect this machine’s default boot image.

or

mkifs qnxbasedmaaps.build /.boot

will change the default boot image to be what is specified in qnxdmaaps.build

If you perform this latter operation, rebooting should see your image boot and run.

Once you’ve rebooted, if all is well, everything should appear to be the same as before, but if you log in and run ‘aps’, you should see output similar to:

# aps

                        +---- CPU Time ---+-- Critical Time --
Partition name   id     | Budget | Used   | Budget |      Used
------------------------+-----------------+--------------------
System      0           |  80%   | 19.70% |  200ms |   0.000ms
Debugging   1           |  20%   |  0.17% |    0ms |   0.000ms
------------------------+-----------------+-------------------
Total                   |   100% | 19.88% | #

This shows that APS is running and that we have 2 partitions - the default ‘System’ partition, and the one we created called ‘Debugging’

If you want to know which APS partition each process/thread is assigned to, you can run:

pidin sched

or for a given process

pidin -p <process name> sched

e.g. to check that qconn is in the Debugging partition:

# pidin -p qconn sched
pid       tid name                   prio cpu ExtSched    STATE
196625  1 usr/sbin/qconn 10r  0     Debugging  SIGWAITINFO
196625  2 usr/sbin/qconn 10r  0     Debugging  CONDVAR
196625  3 usr/sbin/qconn 10r  1     Debugging  RECEIVE
196625  4 usr/sbin/qconn 10r  0     Debugging  RECEIVE
#

And there we have it.

Now, when you run the debugger from Momentics, the program under load will not stop the debugger and network stack from running (because they will get their partition’s CPU quota made available to them), and you should be able to stop even the most renegade of high priority, badly behave, CPU-gobbling processes !

My Device Isn’t Supported : Is There A Quick Fix ?

September 6, 2007 by fieldstudy

As up to date and as diligent as any OS company tries to be with its software driver support, the rate of change of hardware, especially variants of existing devices, out paces anything that we can reasonably keep up with. So, what’s a developer to do when they buy a shiny new motherboard and QNX Neutrino doesn’t detect that new GigE port ?

I get asked this type of question at least once a week. Quite often, we’ve added support for the new device in our source tree, but haven’t released the new version of the driver yet. We do regular releases for sure, but not daily…

Well, there are already a bunch of drivers delivered as part of Neutrino, so it is quite possible that we *do* have a driver that would work, if it only ‘knew’ that the new device is just like one it already recognizes. There are a couple of easy ways to tell a driver about a new device, so here’s what to try if you hit this issue… No guarantees of course, but it can save a lot of heartburn.

Find the PCI IDs

Assuming that the new device is a PCI (or PCI-X or PCI-Express) device, which is a fair bet these days, it has 2 16-bit associated values called its device and vendor IDs, that together identify the device. It is these numbers that Neutrino uses to decide which driver to use to manage the device. Each driver has a (possibly extensive) list of devices that it knows are for it to manage, but it is also possible to pass a specific device/vendor ID pair to a driver and ask it to try to manage the device.

There are several databases that keep lists of these IDs

e.g. http://pcidatabase.com/

and QNX keeps a list of supported devices here:

http://www.qnx.com/developers/hardware_support/index.html

So, if a device isn’t running and you think a specific driver should work with it, the first thing to do is to find its PCI IDs. To do that, on a booted Neutrino system, in a shell run ‘pci -v | more’ to get output like this:

PCI version    = 2.10Class          = Bridge (Host/PCI)
Vendor ID      = 8086h, Intel Corporation
Device ID      = 7190h,  440BX/ZX/DX - 82443BX/ZX/DX Host bridge
PCI index      = 0h
Class Codes    = 060000h
Revision ID    = 1h
Bus number     = 0
Device number  = 0
Function num   = 0
Status Reg     = 210h
Command Reg    = 6h
Header type    = 0h Single-function
BIST           = 0h Build-in-self-test not supported
Latency Timer  = 0h
Cache Line Size= 0h
Subsystem Vendor ID = 15adh
Subsystem ID        = 1976h
Max Lat        = 0ns
Min Gnt        = 0ns
PCI Int Pin    = NC
Interrupt line = 0
Class          = Bridge (PCI/PCI)
Vendor ID      = 8086h, Intel Corporation
Device ID      = 7191h,  440BX/ZX/DX - 82443BX/ZX/DX AGP bridge
PCI index      = 0h
Class Codes    = 060400h
Revision ID    = 1h
Bus number     = 0
Device number  = 1
Function num   = 0
Status Reg     = 220h
Command Reg    = 11fh
Header type    = 1h Single-function
BIST           = 0h Build-in-self-test not supported
Latency Timer  = 0h
Cache Line Size= 0h
Primary Bus Number       = 0h
Secondary Bus Number     = 1h
Subordinate Bus Number   = 1h
Secondary Latency Timer  = 40h
I/O Base                 = f0h
I/O Limit                = 0h
Secondary Status         = 2a0h
Memory Base              = fff0h
Memory Limit             = 0h
Prefetchable Memory Base = fff0h
Prefetchable Memory Limit= 0h
Prefetchable Base Upper 32 Bits  = 0h
Prefetchable Limit Upper 32 Bits = 0h
I/O Base Upper 16 Bits   = 0h
I/O Limit Upper 16 Bits  = 0h
Bridge Control           = 80h
PCI Int Pin              = NC
Interrupt line           = 0
CPU Interrupt            = 0h
[Snip - this list can be *long* for a complex system]
Class          = Network (Ethernet)
Vendor ID      = 1022h, Advanced Micro Devices [AMD]
Device ID      = 2000h,  79c970 [PCnet32 LANCE]
PCI index      = 0h
Class Codes    = 020000h
Revision ID    = 10h
Bus number     = 0
Device number  = 17
Function num   = 0
Status Reg     = 280h
Command Reg    = 7h
Header type    = 0h Single-function
BIST           = 0h Build-in-self-test not supported
Latency Timer  = 40h
Cache Line Size= 0h
PCI IO Address  = 1080h length 128 enabled
Subsystem Vendor ID = 1022h
Subsystem ID        = 2000h
PCI Expansion ROM = 0h length 65536 disabled
Max Lat        = 255ns
Min Gnt        = 6ns
PCI Int Pin    = INT A
Interrupt line = 9
CPU Interrupt  = 9h

Find the device that you’re interested in and record the vendor and device PCI IDs that are listed for it.

Then choose one of the 2 methods below to get the driver going…

Quick command line test

Drivers in QNX Neutrino can be stopped/killed/restarted independently, so one thing to do is to kill and restart the driver that you think is the right one to use, passing the PCI IDs to it as command line parameters:

For instance, there are *many* variants of the Intel Gigabit Ethernet device that we support via our devn-i82544.so device driver. Happily, our driver generally supports the new variants very nicely.

So, a simple test is to run:

slay io-net

[slay is a 'kill by process name' executable that will end the io-net network manager process, which runs network device drivers.]

Then, restart io-net, specifying the i82544 driver and the specific ID that needs to be provided:

io-net -ptcpip -di82544 vid=<VendorID>,did=<DeviceID>

For this driver, the Vendor ID is typically 0×8086, since this is Intel’s main vendor ID - very droll.

The ‘-ptcpip’ part means ‘run the TCP/IP protocol stack’. You might add ‘-pqnet’ too if you want to run QNET.

If this works, your new device will be created in /dev/io-net, and running ‘ifconfig -a’ will show your new device or devices.

Make The Change Permanent

So, if you tried the command line option, and things worked well, you probably want to make the change permanent, so the driver starts up properly every time you reboot.

If you make your own custom boot image, you can just embed the io-net command from above into your build file, rebuild and you’re done.

More often, on X86 boxes, customers tend to use our diskboot process to manage driver startup. This uses a set of tables in /etc/system/enum/devices - one for audio devices, one for net(work) devices etc.

These tables look a bit confusing at first, but adding a new device is pretty simple - you just copy an existing line or lines for the driver you want to work with, modify it for your IDs, then reboot to see if it works…

e.g. to add a line for the Intel Gigabit Ethernet driver, you edit (e.g. with ‘vi’ ) /etc/system/enum/devices/net

and search for 82544

You’ll see a bunch of lines like this:

device(pci, ven=$(PCI_VEND_INTEL), dev=100 8)    # Intel 82544EI Gigabit
device(pci, ven=$(PCI_VEND_INTEL), dev=1009)    # Intel 82544EI Gigabit
device(pci, ven=$(PCI_VEND_INTEL), dev=100c)    # Intel 82544GC Gigabit

… lots more similar lines

Ending with:

tag(devn)
 append(legacy, ",nonet")
 requires($(IONET_CMD),)
 uniq(netnum, devn-en, 0)
 mount(-Tio-net "-opci=$(index),vid=0x$(ven),did=0x$(dev)" /lib/dll/devn-i82544.so, "/dev/io-net/en$(netnum)")
 use(symbolic=netmgr)

This last set of lines are the magic that launch the i82544 driver with the correct vendor and device IDs.

All you need to do to add a new device to this list is copy one of the device lines and modify the dev=<xxxx> lines

e.g.

device(pci, ven=$(PCI_VEND_INTEL), dev=1099)

where 0×1099 is the new device’s PCI device ID.

Save the file, and then reboot (because diskboot only runs at startup) and hopefully your device will show up…

If it does, great - you’re done, and can get back to your real job of getting your product built and tested.

If it doesn’t, let QNX know - it really helps us to know what you need !