Reducing friction in balena device support with our heuristic diagnostics parser

Every six months or so, our company takes a break from most of our normal activities to engage in a “Hack Week,” a week of working on a project completely outside of our day-to-day tasks. People form small teams and work on projects around a common theme, and embark on a friendly competition for bragging rights, swag, and the potential to finalize their product for the masses to use.

This year the theme was “Products for product builders” and my team worked on something called a “Device Diagnostics Parser.” We’re really proud of our work and I wanted to dig in and explain all of the thinking to show off our product but to also give a sneak peek at life at balena.

EDITOR’S NOTE: Hack Week 2022 was a week where every balenista formed various teams to create products and projects that helped them improve the experience with using balena. This year, our theme was “making products for product builders.” In other words, what could we create to support one another to build the next great balena product. They won our popular vote competition for best Hack Week product– give the team a round of applause! 🙂

Building a product to support our support agents

At balena we are following a support driven development approach and every balenista is scheduled for support work. Support is one of our major input feedback from which we derive improvements and new features. Often diagnostic files lead us to fix issues or understand the direction for an improvement.

Manually reading through these diagnostics files is not very efficient and it’s easy to miss problems as well as to get hung up on messages that are not necessarily related to the issue at hand. The contents of the file are the output of a series of shell commands executed on the device, and can be typically 10,000 lines or more in length!

The goal of our Hack Week project was to try extracting patterns from a device’s diagnostics output file automatically. We found ways to structure diagnostic files and build a heuristics parser that matches pattern heuristics against diagnostic inputs. Finally we were able to report the matches back into our support and knowledge tool.

“Wait, what’s a pattern?”

Definition: A pattern is a behavior that we’ve observed from our internal and external users.

An example of a loop at balena

At balena we define patterns as symptoms which are collected and documented in our collective sensemaking database called Jellyfish. Whenever our support agents come across a new issue in a support ticket they create a new pattern or link the ticket to an existing pattern. Over time, this process allows us to see which patterns are occurring more often and require closer attention.

We turn support threads (and any company observations and insights) into patterns

When we receive feedback from customers (in the form of a support ticket or otherwise), we gather as much information as possible and generate one or more “patterns” which are akin to “problem statements.” We can then look through our patterns to consider ways that the product can be improved by resolving the root cause the pattern describes. We aptly call these “improvements” which could also be considered “solution statements.” Well-formed improvements can solve multiple patterns at once, while an improvement that solves no patterns is avoided.

We link feedback streams, patterns, and improvements together in Jellyfish to form a feedback loop. Because of this loop, we are able to inform the feedback sender about our solution when an improvement is completed. This system also lets us see when a pattern is linked to more than one (or many) feedback streams, which carries more weight in our internal decision-making, including how we improve existing features for products or create new ones as well.

Where device diagnostics come into play

Often when a device is not working properly, we get feedback to our product as users raise support tickets. For devices, we can generate device diagnostics that capture the output of a given set of shell commands that get executed on the device’s hostOS. Device diagnostics help us to match existing or generate new patterns and to understand root causes.

The process to gather device diagnostics is simple and can be triggered by any balenaCloud user or support agent through our dashboard via the Diagnostics tab. Device diagnostics are collected as one compound text file containing all outputs from the executed commands.

Regularly this file is around 10k lines of text and currently support agents have to review these files manually. The diagnostics often show specific heuristics that we use to search for existing patterns or to document new patterns for future usage. In addition, these heuristics are often version-specific as we continuously develop improvements to balenaOS, balena Supervisor or balenaEngine and implement solutions to patterns.

Addressing some known friction

Following are some areas of friction we have identified for our support work:

Finding patterns in our knowledge base by heuristics
Linking existing patterns to support threads
Effort to understand, search and link existing already solved patterns (Version dependent fixes) for older software version on not updated devices
Formalizing knowledge and documenting for structured and automated usage
Communicate consistently about patterns to the feedback sender

The friction in these tasks could be reduced when automating the pattern searching, matching and linking. One source of semi structured input is device diagnostics, so an automated way to read, parse and match patterns in device diagnostics is worth considering.

Hacking on the device diagnostics parser

Text file tokenizer / translator

As we already have a device diagnostics solution which is accessible from the balenaCloud dashboard. The diagnostics solution is a collection of shell commands that get executed via a ssh tunnel to the balenaOS host.

We translate the device diagnostics input text file into a structured json file. The output of the different commands executed during the diagnostics run can be addressed by the command string.

Input

“`bash
— cat /etc/os-release —

2022-05-02 16:03:56.535779868+00:00
ID=”balena-os”
NAME=”balenaOS”
VERSION=”2.87.16+rev1″
VERSION_ID=”2.87.16+rev1″
PRETTY_NAME=”balenaOS 2.87.16+rev1″
MACHINE=”raspberrypi0-2w-64″
META_BALENA_VERSION=”2.87.16″
BALENA_BOARD_REV=”4c88de7″
META_BALENA_REV=”95b3bf9b”
SLUG=”raspberrypi0-2w-64″
real 0m 0.01s
user 0m 0.00s
sys 0m 0.01s
“`

Output

json { ... "cat /etc/os-release": { "category": "OS", "command": "cat /etc/os-release", "time": "2022-05-02 16:03:56.535779868+00:00", "stdout": "ID=\"balena-os\"\nNAME=\"balenaOS\"\nVERSION=\"2.87.16+rev1\"\nVERSION_ID=\"2.87.16+rev1\"\nPRETTY_NAME=\"balenaOS 2.87.16+rev1\"\nMACHINE=\"raspberrypi0-2w-64\"\nMETA_BALENA_VERSION=\"2.87.16\"\nBALENA_BOARD_REV=\"4c88de7\"\nMETA_BALENA_REV=\"95b3bf9b\"\nSLUG=\"raspberrypi0-2w-64\"", "real": "0m 0.01s", "user": "0m 0.00s", "sys": "0m 0.01s" }, ... }

Heuristic Matcher / Parser

This structured JSON is fed into the pattern matcher which checks for a finite list of heuristics against selected command outputs.

Heuristic example

json { "type": "object", "title": "Engine storage migration timing out.", "description": `The device tried to migrate the Engine storage from aufs to overlay2, but the migration took more than the initialization time out, so Systemd killed the Engine during the migration. Fixed in balenaOS v2.98.4.`, "properties": { "permalinkPattern": "https://jel.ly.fish/pattern-container-images-redownloaded-hup-engine-killed-due-timeout-middle-migration-358ef91", "recentStorageMigration": { "allOf": [ { "description": "We have recently done a storage migration", "$$formula": "/Storage migration from aufs to overlay2 starting/.test(contract['journalctl --no-pager --no-hostname -n 1000 -at balenad'].stdout)" }, { "description": "balenaEngine timed out during initialization", "$$formula": "/balena.service: start operation timed out/.test(contract['journalctl --no-pager --no-hostname -pwarning -perr -a'].stdout)" }, { "description": "OS version is less than v2.98.4", "$$formula": "SEMVERCOMP(REGEXEXTRACT(contract['cat /etc/os-release'].stdout, 'balenaOS ([0-9]+.[0-9]+.[0-9])'), '2.98.4', '<')" } ] } } }

The heuristics need to be defined as search: regular expressions or plain javascript functions. The parser runs each heuristic over selected output fields of the structured diagnostics file. Most of the heuristics are string operations. The parser itself is not limited to just performing string matches and can work on any kind of input.

Moreover, the parser can also extract information from the input and provide the next step. The Jellyscript parser uses formularjs syntax, javascript syntax, and is extendable with customer comparison functions. We need to check for the particular version of the component. This makes it easier to formalize and separate older heuristics from newer versions. Therefore, we built a semantic version comparator into the parser.

Output of heuristic matcher

json { ... "results": { "engineStorageMigrationTimeout.ts": { "recentStorageMigration": { "value": true }, "title": "Engine storage migration timing out.", "description": "The device tried to migrate the Engine storage from aufs to overlay2, but the migration took more than the initialization time out, so Systemd killed the Engine during the migration. Fixed in balenaOS v2.98.4.", "permalinkPattern": "https://jel.ly.fish/pattern-container-images-redownloaded-hup-engine-killed-due-timeout-middle-migration-358ef91" } } ... }

Hackweek vibes

Discussions during hackweek

For us as a hack team, it was super valuable to discuss our approaches and see how our solutions may embed into our systems in the future. We had a couple of discussions that are notable to share.

We discussed how to automatically find new patterns or add heuristics to existing patterns and came up with ideas about training extraction AI with all diagnostics files that we have in Jellyfish and combine them with the existing patterns on the support thread. Also discussed was a more abstract and generic set of heuristics, eg. every line with error in it may be a potential heuristic. Then, after the existing heuristics have been checked, run again with these generic heuristics and suggest new found heuristics. It still involves a manual process to translate generic found heuristics to real heuristics and connect them to patterns.

As this automated diagnostics pipeline is a potentially valuable feature for our users we discussed how we can surface the feature to the user. Ideas even floated around a self-healing feature, which would implement the pattern parser into the dashboard and our balena API to provide solution steps for found patterns. The user then can decide to apply the solution steps automatically.

One step further would be a predictive support thread opening, which would involve running diagnostics every now and then and executing the checker on the device. Each device configured to automatically check itself reports found heuristics and automatically generates a support thread.

Some very progressive thought behind this is that we may want to fix a device before the customer even realizes it, when the fix doesn’t interrupt production or whatever is in production is already broken and a fix will recover without added harm. A device would also automatically increase the pattern weight which would help us understand the distribution of a given pattern.
From an open source and community perspective, we thought about making this solution publicly available. Some ideas circulated around containerizing the solution including the knowledge base to ship it to any balenaOS device so that the device can run it on its own.

Working close together 100% remote

To work together we used Teamflow, a digital space where rooms, listening groups, and any kind of shared digital resources are easily accessible. Using Teamflow over the week was a good experience and very useful to synchronize on the project. It felt more intuitive to have a room where you can find your team colleagues instead of scheduling meetings / ping in traditional company chat to check if someone is around.

The built-in screen sharing via Teamflow helped a lot to mock and discuss interfaces and implementation details. Working with an async team which also had a lot of other duties during hack week still worked out and the time we spent together was focussed but also highly social!

Working in the team helped to find existing solutions and adapt them to our implementation. In particular, without the presence of a long-time balenista, we most probably would have not used Jellyscript as the underlying parser for the heuristic matcher, which reduced the amount of work drastically.

Outcomes and next steps

Hack Week showed that there is an urgent need to automate the pattern detection / matching in support work. As every cool project needs a cool name, we created the Sherlock Pattern Initiative. Now we need to find a way to actually make it part of our product for product builders, namely Jellyfish.

What’s next?

Our next milestone would be to transition the hack week implementation into a usable implementation as a contribution to the internal Jellyfish ecosystem to help all of our support agents. For our users out there who work with our support agents to debug device issues, maybe you’ll see the Device Diagnostics Parser at work soon enough. Until then, back to hacking!

If you have any questions about Hack Week or what we built, let us know in the comments!