MeshTastic – Issues I’ve Found

18th May 2024 richard Leave a comment

So, this is a list mostly for my reference but for our local mesh too. Some of these I’ve raised, some have been raised, some have been raised, fixed and regressed, some I’ve been told is me being stupid and in one case “Meshtastic is not suitable for battery or solar use”

I’ve devided these up into showstoppers and described each issue as best I can. I’ll probobly submit these as I tidy up my findings. I’ll also divide thes into “irritants” and “showstoppers”. These are just my opinions though.

WHAT: ADC calibration in Meshtastic Web is just broken. Only allows whole numbers.
HOW: Using the web UI try to change ADC Multiplier Override ratio. If you add a decimal point it is deleted meaning only whole numbers are allowed.
IMPACT: It is impossible to calibrate the ADC from the web UI. This is a very important step.
COMMENT: How has this been missed? Reports of this bug seem to go back many months. Misconfiguration of the ADC multiplier can result in flash corruption. This is basic UI stuff!

WHAT: ADC calibration in Android App. UI “Fights” user input
HOW: In the android app go to “Radio Configuration” and “Power”. Under “ADC Multiplier Override ratio”, delete the contents and attempt to enter a floating point number EG 5.75. The UI will force 5.0 when 5 is pressed, the next digit will be igmored and a 0 added giving a ratio of 50.0.
IMPACT: Irritating, makes the job harder and as no range checking is done (but the braindead input checking is) it will result in an ADC value being set that breaks the battery monitoring.
COMMENT: Again, how has this been missed. I cant find any reports on this issue with a quick search but has been present in all versions of the app I’ve used.

WHAT: App/Web/CLI features are inconsistant
HOW: Feature parity between management models is onconsistant. Activeley manageing a node requires two or more methods of access.
IMPACT: Irritating. It makes managing a mesh tiresome and troubleshooting a node in the feild can require multiple devices or accessing the web UI as well as using the AP.
COMMENT: Many features are inconsistant across all four major platforms while developers continue to add “shiney” stuff. Consistancy is important if this is to be considered a viable platform for the use cases it claims to be for.

WHAT: Bluetooth is disabled while WiFi is in use
HOW: Enable WiFi connectivity and attempt to access node via Bluetooth. No connection will be established.
IMPACT: Irritant. This may be a hardware limitation BUT it doesnt seem to have a mention anywhere and is a frequent query. Its related to the following issue too.
COMMENT: This may be a documentation thing if it is a hardware limitation. WiFi and Bluetooth are known to interfere with eachother so it’s not surprising but maybe a mention is needed.

WHAT: WiFi connectivity issues require USB/Serial connection to fix and slows node
HOW: Configure Wifi with wrong credentials. Node becomes unavailable and remains that way. The UI on the screen is notably slowed. I’ve not been able to verify is the overall system is slowed. You’ll need a Serial connection or to connect via USB to fix.
IMPACT: Major Irritant. As the UI has no scan facility it’s easy to get the connection details wrong. This results in a node that will need reconfiguring via a physical connection. This is especially a pain if the node is installed somewhere and has marginal signal.
COMMENT:
A solution may be ‘n’ number of attempts then switch to Bluetooth. A scan type interface would lessing the chance of misconfiguration along with the ability to provide an alternative network or even go into AP mode.

WHAT: No security, not even basic username/pass anywhere
HOW: No authentication is required to access a node
IMPACT: Irritant/Critical security issue. A stolen or lost node with admin channels configured can be used to hijack other nodes on that admin network.
COMMENT: This is basic good practice, Come on! The impact of this has been downplayed when mentioned however it could allow a whole “fleet” of on persons or a communal admin group of nodes to be hijacked. In this current climate, and again, the idea these are useable as an off-grid, in case of emergency platform this is unacceptable

WHAT: Low battery causes infinate sleep. Will not wake when power returns.
HOW: Allow battery to run below discharge level of about 2.7V. Device will enter deep sleep and stay sleeping. The code defaults to a sleep of 36 years before it wakes!?
IMPACT: Critical. Solar based nodes or nodes with intermitent power will not wake when power returns. Node will become “comatose” until the power source is no longer able to sustain the uC and a complete shutdown occurs. This could take weeks or even months depending on the power source and if power is restored the device may charge while not waking making this a completely unresolvable situation. Node will need to be reset or the user button pressed. We have solved this with an external MCU.
COMMENT: Who decided sleeping forever was a good idea? This settin will result in a solar/wind/intermittant power sourced node “bricking” itself. This is a bad, bad default to hide away. Documentation on the implications on this setting is unclear as to what it does and how it is used. This can, and has resulted in remote nodes becomming dead and needing manual intervention.

WHAT: ADC/Battery calibration is/can be critical. Docs bareley touch on this
HOW: If ADC calibration is not done, battery may be exhausted before node shuts down. This causes a brownout condition (see below) and poses the risk of a serious, destructive battery failure. Using a variable power supply in place of a battery reduce the voltage slowly to below the point the node runs.
IMPACT: Critical. Potential safetey risk. Some nodes may shutdown at the prescribed cutoff points, other will lock up/fail/enter an unknown state. Converseley node may enter sleep before battery is exhausted and go comatose, see above. There is a risk of a battery pack entering deep discharge which can result in catastrophic failure during later charging.
COMMENT: This should not be a simple “you can do this”. If you plan on running on battery you *MUST* do this and verify it’s either correct or the unit shuts down early. An early sleep is irritatating but not a safetey issue which running till the battery is dead, is. Use of a DW04 base battery saftey board is a must especially with LiPo cells. This is something that devs and integrators (users) need to be aware of.

WHAT: No aparent watchdog implementation
HOW: If node hard locks, it stays locked. Glitch the power rails a few times and node hard locks. Corrupted flash can also cause a hard lock
IMPACT: Irritation, although the ESP watchdog *SHOULD* catch this condition. It doesnt always seem to. It may be the case the board starts “watchdogging” in a loop.
COMMENT: Mechanisms for dealing with this are present in hardware. Firmware *can* catch a watchdog reset and deal with it.

WHAT: No aparent use of brownout detection
HOW: If the sleep/power monitor function is not working correctly this *will* happen. Otherwise disable power monitoring and reduce power rail voltage until the unit either stops responding or starts a reboot loop.
IMPACT: Critical. The node can be rendered partially or completly inoperable even when power is restored. Most common result is loss of node name, corrupt node db giving symptoms similar to failed front end, unable to configure power module and other issues up to an unresettable node throwing ESP exceptions.
COMMENT: Basic embedded design. The ESP chips provide a brownout detection mechanism but this does not seem to be used. Flash read/write should not be happening during a brownowt condition and it’s the job of the brownout detector to stop this from happening. Relying on ADC sensing alone is not good enough.

WHAT: No aparent detection of corrupt flash
HOW: See above (Brownout detection)
IMPACT: Critical. Node can be misconfigured, left in an indeterminate state potentially with critical settings (ADC ratio, band, TX settings) incorrect. Nodes can sometimes behave in unpredictable states. The ONLY fix is a reflash with erase.
COMMENT: Input validation from EEPROM/Flash/SPIFFS/ETC needs to happen either with feilds or configuration blocks being checksummed or CRC’ed. Graceful fail to safe defaults and trash the node DB if corruption happens.

WHAT: –
HOW: –
IMPACT: –
COMMENT: –

Uncategorised

Meshtastic. The first month…

15th May 2024 richard Leave a comment

Well its been a month now I’ve been working with Meshtastic and I think its time I went over what I have found.

Online via Youtube, Mestastic is being promoted as the Paneacea of offline, decentralised communications, however the reality isnt quite so clear. On paper its an awsome system and if you follow the videos posted by many, its the solution to everything.

In use, when things are working as advertised, its a good system, it definateley has applications and I can see a number of use cases for it. It’s fairly easy to get up and running without really knowing anything about it and you can have a node running in minutes. On the surface it seems to work, and work reasonably ok. but start digging deeper and things arent so great.

Firstly in the UK it’s crippled, this isnt just Meshtastic but LORA in general. there are two, utterly imcompatible channels available to you, 868MHz and 433Mhz. You can use either licence free but these are the ONLY channels in the UK, just two, for a long-range licence free application. LORA is painfully slow so congestion becomes an issue really fast. On top of this there are limits to how long you can transmit for too. To make it worse, licenced users are on the same channels and are allowed to use higher ERPs so this makes an already bad situation worse. Then lastly, these channels arent JUST for LORA, there are other things on these bands expecially 433 which is close to the business UHF bands and used heavilly by keyfobs etc. If you get interference you can’t work round you are screwed, there is no fix. Again, not a Meshtastic thing but you need to be aware of it.

MQTT is the next bugbear. despite all tutorials saying not to use it on public channels people do. The result is chaos and an unseable mesh. There have been steps to reduce this issue with options to control rebroadcasting and some defaults have been changed. However MQTT nodes do pop up and they can flatten a large mesh almost instantly. Don’t get me wrong, in the RIGHT situation its a really neat feature for gatewaying private channels or joining two disparate meshes, but on general public channels its a complete nightmare.

This brings us on to the biggest bugbear, its so unbeleiveably fragile. MQTT is just one of many ways it can die. I realise this is an open source/free project but there is a clear fealing that we are following the “move fast and break things” mantra and that co-ordination within different teams isnt great. The firmwares frequently cause some odd issues and often the updates are adding features and not necescerilly helping stability. There are long term bugs that have entried on the bug tracker that arent getting fixed or get regressed. Adding the ADC ratio for power monitoring is an excellent example. Its been broken for months and had “aparently” been fixed at some point. Small changes can have far reaching, undocumented affects which leaves users wondering why changing X broke Y. This has now resulted in personally, sticking to firmwares with a good few weeks under the belt. Rember, Beta is supposed to be the last step before release, given there IS no release maybe there should be a feature hold and we get a release firmware?

There are issues with firmware that stem from what seems to be a lack of understanding of the embedded enviroment. A bad battery can corrupt the falsh when the uCs in use have mechanisms to stop this. In the case of the LORA32 there is a battery monitor to try and stop this condition but the ADC is inaccurate and just how important this is to get right isnt mentioned anywhere. *IF* you get it tuned you’ll have a node that then goes to sleep with no ability to wake when power returns or, as I suspect, the default wake is set to a time period measured in milennia. get it wrong and you are stuffed. These issues paired with a solar power supply can leave nodes in an unknown state , partially functional or just dead. One of our solar nodes, which now has a separate supervisor, is up a 25M mast, behind three locked gates and alarm system that requires telephone authorisation to gain access. This isnt a well received bug here.

WiFi is awesome and useful but it disables Bluetooth. This is a device limitation I suspect but it means if you pick up your node and go somewhere else OR get your SSID details wrong there is no recovery from this. As WiFi details are manually input there is a good scope for hashing this up. You’re then going to have to return to your WiFi connection if you went out of range, and disable WiFi or in the case of wrong credentials, plug in a USB cable. This is just an exaple of an ill thought out process thats just not user friendly at all. There are ways of dealing with this and it may be device specific, but it *should* have been caught.

The serial module doesnt behave quite the way it should, and can die for no reason, returning after some time. This cause hours of lost time setting up our external supervisor. There is no rhyme or reason behind this. There is also the tendancy for the module to send a random char when it starts, combine this with the module boot looping because of lack of brownout detection, you get a node that can spam the mesh just because it’s battery is low. This is all down to lack of testing for or understanding edge cases in an embedded environment. The same issue has shown up inside our mesh where firmware doesn’t apply correctly and leaves nodes in an unrecoverable state needing experimentation or borderline Vodoo incantations to get the node back. The power issue already mentioned above can cause a weird situation that looks like the front end has failed.

On top of this there is so much inconsitancy in the management of the thing. The Python CLI can do things the Android APP cant do that can do things that the Apple App and cnad the web client…. Data and parameters are presented inconsitantly accross the various platforms. For example, configuing our nodes..
Meshtastic web is the easiest way to get the basics. But we can’t setup the admin bits, so that needs the Python CLI. I’m fine with that, its actually not a bad solution and keeps the admin bits out the way. But then to setup the ADC multiplier I have to use the Android app because the input for this in Meshtastic web has been broken for months. I can only Traceroute and get a true idea of siganl quality from the Android app and so on.

On the subject of inconsitancy, why should two identiaclly configured nodes in the same location see different messages? Why can one send but not the other? Why do they behave differently on the mesh and worse, why do the ways they differ change constantly?

The key to reliable, resiliant and useful comms is consistancy. A longe range but inconsistant system is worse that a short range one that consistantly works. I can send text messages and bo a lot of what Meshtastic does with our UHF radios, just not the mesh part. but a repeater is an easy task to deal with. Two of our nodes share mast space with our UHF repeaters.

Developing for Meshtastic….yeah good luck with that.
Meshtastic is built around Protobuffs. We are told that’s how its done and thats about it. Want the specs for how to talk to it, you are going digging in the code. Want to know how to connect, well thats not really documented either. Want to use something that can’t use the Protobuffs code, you are SOL, there is no help from that quarter. In fact if you don’t know all about Protobuffs you arent going to be do doing anything with the Meshtastic code. I realise it’s probobly a good solution but its thrid party code thats tied to specific languages AND is the plaything of a major corporation who has demonstrated time and time again, they will take their toys away no matter how many people it causes greif for. Its also horribly overcomplicated for this application. Ateempts I have made to gain access to protcol information to code my way out of this rabbit hole failed because of the insaneley poor documentation and the dev’s unwillingnes to help those that won’t follow the “one true way”. Sadly the documentation issues persis into the code base which is poorly commented and insaneley confusing. Maybe this is the reason for the inconsitancies?

Meshtastic has a huge amount of potential. As it stands it is fragile, unreliable and the software base isnt consistant or stable. I had hoped to contribute towards the project and add modules and addons but at this stage I/we will just roll our own solution as Meshtastic isnt ready or, as it seems, doesnt WANT to be ready.

Richards Place

Monthly Archives: May 2024

MeshTastic – Issues I’ve Found

Meshtastic. The first month…

Random Musings and Projects