How to debug character errors in Python

Question

Level 6

15,025 points

How to debug character errors in Python

I lost some hair when correcting a simple Python 3.12.3 script which hangs.

Then by copying a similar working script line-by-line and finally character-by-character I noticed that the problematic script had just one cyrillic 'а' character when it should be latin 'a' character:

а

CYRILLIC SMALL LETTER A

Unicode: U+0430, UTF-8: D0 B0

a

LATIN SMALL LETTER A

Unicode: U+0061, UTF-8: 61

So all I had to do was to fix that character in this line:

a = c

I am lazy to type and had copied that script from a .svg file by making a screenshot and using macOS 14 OCR to it. Yes, there were many obvious errors which I corrected. Some wrong 'c' and 'x' character errors could be spotted by close inspection. I used BBEdit to highlight the same character patterns, and copied the script to LibreOffice Writer and used very large fonts but still missed that error in 'a'.

→ Is there some better debugging method for this?

IDLE does report errors in some lines but this script just hangs. 'python -m trace --trace YOURSCRIPT.py' did provide some clues to the offending lines. I am still a Python newbie so be gentle.

Mac mini

Posted on May 18, 2024 2:16 AM

Reply

Answer 1

May 18, 2024 8:28 AM in response to Matti Haveri

Okay, this question covers a range of topics.

First, when posting code-involved questions, it helps greatly to post code; to post a concise reproducer. This is what started all this off, and you didn’t post the hunk of code that stalled things. Yes, that can involve binary search within the code, with print-based or other debugging, too. But code and errors matter.

Here is how Python deals with different file encodings.

https://docs.python.org/3/howto/unicode.html

This is an overview of Unicode, which most of us will inevitably need to know more about than what we would prefer to know:

https://tonsky.me/blog/unicode/

There can be some great “fun” awaiting in a UTF-8 or other Unicode file. An OCR tool usually won’t give you a gremlin character (an invisible of some sort) for instance, but other tools will:

https://www.thelinuxrain.org/articles/hunting-gremlin-characters

I’m presuming use of Apple Live Text as your OCR here, so Cyrillic seems an odd choice if English was otherwise being detected, but this also wouldn’t be the first time that some wad of ML hallucinated. Apple doesn’t have controls over this detection, either. OCR ~always leaves some rubbish.

BBEdit can show the character encoding for the file in the status bar, when that is enabled. I’d be shocked if it couldn’t search for non ISO Latin-1 characters, too.

I would be exceedingly cautious around getting Office or LibreOffice or other such in the mix when programming, as you’ll need to have that output plain text files only, and not any of the Office formats. BBEdit, vim, emacs, pico, etc., all work with plain text files.

I usually use xxd when looking for “fun” characters in files, and that tool is reversible for data files; you can hex dump a data file (or a text file) and then edit the hex dump, and then xxd to convert the patched file back to the original file format. The hexdump or other tools can also be used.

BBEdit can do various conversions and searches as well:

https://apple.stackexchange.com/questions/408181/changing-character-encoding-from-unicode-to-ascii

I’d be surprised if BBEdit couldn’t somehow highlight ranges too, but I don’t use that editor heavily.

Using file (as mentioned above) and grep for file format spelunking:

https://unix.stackexchange.com/a/474812

The following converts a UTF-8 file into an ISO Latin-1 (~ASCII) file, though lots of UTF-8 isn’t present in ISO Latin-1 and will get vaporized. Do not specify the same file name for input and for output. (Long switches shown, and -t and -f will work, too.)

iconv --from-code=iso-8859-1 —-to-code=utf-8 < utf8.txt > ascii.txt

macOS and common Apple tools deal reasonably well with UTF-8 in most spots including in the command shell, but other UNIX platforms can have issues.

TL;DR: xxd and look

Reply

Answer 2

etresoft

Level 8

47,436 points

May 18, 2024 5:02 AM in response to Matti Haveri

You can run "file" on your file. If it shows:

file.txt: Unicode text, UTF-8 text

then you've got some funky characters in there. It should say ASCII text.

However, in modern languages like Swift, there's nothing wrong with Cyrillic variable names, or even emoji.

Reply

Answer 3

May 18, 2024 11:19 AM in response to Matti Haveri

Looking at that Python, I’d expect it never gets out of the while loop. That’s the hang. Stepping through the code or instrumenting the code would have shown where the hang lurked, too.

This case is an example of why explicit declarations are popular with some developers, and variations of this are is among the ways to run into trouble with any dynamic language, Python or otherwise. (The recently-added Python type aliases are interesting here, though not enough.) Various BASIC dialects have explicit declaration keywords to override the usual language default laisse faire variable declaration system, too.

https://docs.python.org/3/library/typing.html

https://typing.readthedocs.io/en/latest/spec/type-system.html

Swift would have thrown an error with the (lack of a) declaration, as would some other language choices.

Reply

Answer 4

Matti Haveri Author

Level 6

15,025 points

May 18, 2024 10:36 AM in response to MrHoffman

Thanks for the tips.

Attached is the unfixed script that hangs (I left it localized here). It could be fixed with that one letter. Basically function 'f(x)=sqrt(x)-3x+x^2-4' has one zero point. The script asks the user positive values a and b as long as 'f(a)·f(b)<0', and then it calculates the zero point between ]a,b[ with five decimal accuracy. Try to hit 0 and 7 for a and b, for example and it hangs.

Python tutor script

Reply

Answer 5

Matti Haveri Author

Level 6

15,025 points

May 18, 2024 8:05 AM in response to etresoft

Thanks. I now noticed that BBEdit can pinpoint non-ASCII characters by converting the file to 'Western (ASCII)' and then try to save that and choosing 'Show Unmappable Characters'. My test file had numerous ä and · characters in comments while a single cyrillic а in the actual code caused the script to hang.

Using cyrillic а in the code does work if there is no latin a in the mix.

Reply

Answer 6

etresoft

Level 8

47,436 points

May 26, 2024 7:40 AM in response to etresoft

Testing. Please ignore.

Reply