a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar wikipedia:Homoglyph
Unicode homoglyphs can be a nuisance on the web. Your most popular client, AlaskaJazz, might be upset to be impersonated by a trickster who deliberately chose the username ΑlaskaJazz.
AlaskaJazzis single script: only Latin characters.
ΑlaskaJazzis mixed-script: the first character is a greek letter.
You might also want to avoid people being tricked into entering their
www.faϲebook.com instead of
www.facebook.com. Here is a
utility to play
with these confusable homoglyphs.
Not all mixed-script strings have to be ruled out though, you could only exclude mixed-script strings containing characters that might be confused with a character from some unicode blocks of your choosing.
ρττare fine: single script.
AlloΓis fine when our preferred script alias is ‘latin’: mixed script, but
Γis not confusable.
Alloρis dangerous: mixed script and
ρcould be confused with
This library is compatible Python 2 and Python 3.
Is the data up to date?¶
The unicode blocks aliases and names for each character are extracted from this file provided by the unicode consortium.
The matrix of which character can be confused with which other characters is built using this file provided by the unicode consortium.
This data is stored in two JSON files:
confusables.json. If you delete them, they will both be recreated by
downloading and parsing the two abovementioned files and stored as JSON