4617: Diff fragments do not load correctly when the fragment contains non-ASCII characters

jhominal
chipx86
chipx86
4619, 4620

What version are you running?

3.0.0

What's the URL of the page containing the problem?

/r/{review-request-id}/

What steps will reproduce the problem?

  1. File a review request with a new file with at least two lines containing non-ASCII characters
  2. Create a review with two diff comments, one for each line containing non-ASCII characters
  3. Publish the review

What is the expected output? What do you see instead?

Expected: The page works as expected.
Actual output: One of the comments does not load at all, and the other comment's html contains some extra characters (that match the starting digits of the ID of the non-loading comment)

What operating system are you using? What browser?

Windows, on Chrome and Firefox.

Please provide any additional information below.

The root cause of this bug is a mismatch in the reported HTML length on the route /r/{review_request_id}/_fragments/diff-comments/{diff_comment_id}/ - that is, by looking at the source code, the written length for the HTML is a UTF-8 byte count, but the Javascript code uses that length as a unicode character count - these two counts do not match when non-ASCII characters are present.

I have been able to patch on my instance that issue by replacing line 1429 with:

payload.write(b'%d\n' % len(html.decode('utf-8')))

That is, I do an additional decode in order to get the real unicode character length.

chipx86
#1 chipx86

Thanks for the report and the patched code. I'll get a repro case going and get this into the next release.

  • -New
    +Confirmed
  • +Release-3.0.x
  • +Component:DiffViewer
  • +chipx86
chipx86
#2 chipx86

Fixed on release-3.0.x (4f73586b29857839648043988b7193645e76d12f)

  • -Confirmed
    +Fixed
#3 jhominal

Thank you very much!

However, there is an issue in the report that I have thought of only now, which is that I remember now that Javascript does not index characters according to unicode code points (which Python does for unicode objects), but according to 16-byte units (meaning that non-BMP code points are represented with a pair of surrogates, and have a length of 2 in the Javascript character count).

I will check that today, and file another bug if my suspicions are confirmed.

chipx86
#4 chipx86

You're right, this is a problem. I'd like to reuse this bug for any further discussion on this, since the core problem is not truly fixed. Reopening.

  • -Fixed
    +Confirmed
chipx86
#6 chipx86

So the plan was to put in the character counts instead of byte lengths as a temporary measure just for this release, so that we could ship tomorrow. However, given that it didn't really solve the problem, and given that the holidays are coming up, we're going to delay the release to the week after Christmas and do this right.

The proper solution, which I've been working on, is to retain byte strings and move to binary parsing on the client side. We'll be loading all data into an ArrayBuffer, parsing that out, grabbing byte lengths, and then converting the chunks of content into JavaScript strings via JavaScript's Blob and FileReader. I have this working locally, and reliably, but it still needs additional love and testing.

#8 jhominal

Hello,

I believe that this bug is fixed on 3.0.2 (I have no issues on 3.0.2, except for #4627, which was patched, and you already fixed it ;))

Thank you!

david
#9 david
  • -Confirmed
    +Fixed