Dealing with Non-ASCII Characters

July 13, 2020

Problem

Quick: why is this JSON not valid?

{
  “user”: {
    “username”: “jpalardy”,
    “first_name”: “Jonathan”,
    “last_name”: “Palardy”
  }
}

[reveal the answer]

Curly quotes!

Trick question? Yes and no... this happened to me and it was difficult to troubleshoot visually.

A Class of Problems…

Many text formats, programming languages and other machine-parsed texts have rules about what characters are allowed and not.

When in doubt, the lowest common denominator is usually ASCII:

The decimal set:

nul    1 soh    2 stx    3 etx    4 eot    5 enq    6 ack    7 bel
bs     9 ht    10 nl    11 vt    12 np    13 cr    14 so    15 si
dle   17 dc1   18 dc2   19 dc3   20 dc4   21 nak   22 syn   23 etb
can   25 em    26 sub   27 esc   28 fs    29 gs    30 rs    31 us
sp    33  !    34  "    35  #    36  $    37  %    38  &    39  '
(    41  )    42  *    43  +    44  ,    45  -    46  .    47  /
0    49  1    50  2    51  3    52  4    53  5    54  6    55  7
8    57  9    58  :    59  ;    60  <    61  =    62  >    63  ?
@    65  A    66  B    67  C    68  D    69  E    70  F    71  G
H    73  I    74  J    75  K    76  L    77  M    78  N    79  O
P    81  Q    82  R    83  S    84  T    85  U    86  V    87  W
X    89  Y    90  Z    91  [    92  \    93  ]    94  ^    95  _
`    97  a    98  b    99  c   100  d   101  e   102  f   103  g
h   105  i   106  j   107  k   108  l   109  m   110  n   111  o
p   113  q   114  r   115  s   116  t   117  u   118  v   119  w
x   121  y   122  z   123  {   124  |   125  }   126  ~   127 del

(courtesy of `man ascii`, a reference never too far)

And while “curly quotes” might seem like a made-up problem¹, there are other insidious examples:

en dash, em dash
non-breaking space and tab (to a lesser extent)
carriage return and newline
in general: homoglyphs (other examples)

Solutions

There is no general solution to all the problems, only an assortment of tricks:

“weird spacing” is often flagged or fixed by text editors; details will vary
file formats: can be fixed with dos2unix or similar
external linters can be your sanity check:

> jq . invalid.json
parse error: Invalid numeric literal at line 2, column 13
>
# better than nothing? 🤔

The Non-Visible ASCII regexp Trick

If what’s allowed is “visible ASCII”, what’s not allowed is “non-visible ASCII”:

[^ -~]

described in words: all characters not between “space” and “tilde”
(I don’t remember where I picked up this trick. I would appreciate a link if you know.)

Why does this work? Referring back to the ASCII table from above:

nul    1 soh    2 stx    3 etx    4 eot    5 enq    6 ack    7 bel
bs     9 ht    10 nl    11 vt    12 np    13 cr    14 so    15 si
dle   17 dc1   18 dc2   19 dc3   20 dc4   21 nak   22 syn   23 etb
can   25 em    26 sub   27 esc   28 fs    29 gs    30 rs    31 us
     /--- start here
sp    33  !    34  "    35  #    36  $    37  %    38  &    39  '
(    41  )    42  *    43  +    44  ,    45  -    46  .    47  /
0    49  1    50  2    51  3    52  4    53  5    54  6    55  7
8    57  9    58  :    59  ;    60  <    61  =    62  >    63  ?
@    65  A    66  B    67  C    68  D    69  E    70  F    71  G
H    73  I    74  J    75  K    76  L    77  M    78  N    79  O
P    81  Q    82  R    83  S    84  T    85  U    86  V    87  W
X    89  Y    90  Z    91  [    92  \    93  ]    94  ^    95  _
`    97  a    98  b    99  c   100  d   101  e   102  f   103  g
h   105  i   106  j   107  k   108  l   109  m   110  n   111  o
p   113  q   114  r   115  s   116  t   117  u   118  v   119  w
x   121  y   122  z   123  {   124  |   125  }   126  ~   127 del
                                              stop here ---/

What is before space? various non-visible characters…
What is after tilde? del, but also ALL other Unicode characters!

Why is this useful? Many text editors can highlight based on regular expressions:

curly quotes highlighted in vim
(this is vim; use :set hlsearch to turn this on)

This trick works everywhere regular expressions work:

curly quotes highlighted in grep

Footnotes:

copy-and-paste from Google Doc, Slack … and let’s compare notes 😐 ↩

	GitHub
	Bluesky
	Email
	RSS

Dealing with Non-ASCII Characters

Problem

A Class of Problems…

Solutions

The Non-Visible ASCII regexp Trick

Discuss on Bluesky