Index of strings

Regular Expressions and counting lengths 2025-11-27

I was messing around with regular expressions on some default interpreters on my Debian machine, wondering about what the default encoding behaviour for string literals might be. As you do. So, given a string with a "pesky foreign accent" in it, how many characters do various languages think it has?

Unfortunately, this creaky old blog software I hand-cranked cannot render this amount of markdown source blocks (ROFL), and I so I collated a GitHub gist of my findings. Despite all it's many faults, GitHub excels at markdown-serving.

The languages I tested are Perl, Python, PHP, Ruby, Bash, Common Lisp, and JavaScript.

The test ? Does the string café (an English loan word with an accent character ) match the regular expression denoting 'a four character string'

On my computer, with a UTF-8 locale, this string, while clearly four characters long, occupies five bytes . So does the string match four characters or five characters? This is computers, so obviously the answer is it depends.

Maybe you'd like to guess before you go look at the answers?

posted by cms on 2025-11-27

tagged as

beatworm.co.uk

Regular Expressions and counting lengths 2025-11-27