New blog import process mangles blog content

  • Looking at my list of blog posts -- https://www.sqlservercentral.com/author/solomon-rutzky --  it's clear that there is a difference between the most recent post (on 2019-03-25) and the previous ones.

     

    The newest post has the correct blog name ("SQL Quantum Leap" instead of "SQLServerCentral Blogs"), and I see that the blog description is finally pulling from the actual description instead of using the first X characters of the post (awesome, thanks!), but the title and content have issues:

     

    1) there is an encoding issue with the title. This happened on the previous site as well. But the "??" at the end of the title ("SSMS Tip #3: Easily Access/Research ALL Unicode Characters (Yes, Including Emojis ??)") should be: 😸 . That character is a supplementary character (an emoji in this case): U+1F638 . In the feed ( https://sqlquantumleap.com/feed/ ), the title is not HTML encoded (WordPress decodes anything HTML encoded in the title), so that character shows up exactly as you see above.

     

    2) In the post content, there are several instances of "??" (and even "?") which should be various non-ASCII characters, some supplementary characters, but some are regular BMP characters, such as "ख़" (ख़)  and "Ѹ" (Ѹ).  These are definitely HTML encoded in the feed and should remain HTML encoded when imported. It appears that the import process is decoding them, but there is no reason to do that.

     

    3) In the post content, inline style attributes are being removed. They were not removed in the previous site, and looking at my previous posts (anything prior to 2019-03-25) you can see a definite difference. Since SSC.com does not import a blog's style sheet (and I get that as it could easily interfere with the sites style), I really need the inline style info in order for tables to format correctly.  Please, please do not strip out the inline style attributes. Or make it an option to set in the Profile, or something.

     

    Thanks, Solomon...

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Here is how to reproduce the encoding issue (at least in terms of forum posts, not sure if it is the same issue that is happening with imported blog posts):

    1) Use an HTML-encoded non-ASCII character:  ख़

     

    2) Save the post.

     

    3) Edit the post. The character will now appear as the actual character in the editor; you will not see the HTML-encoded form that you started with (and is shown above -- I was only able to keep that character from encoding by HTML-encoding the ampersand as well)

     

    4) Save the post.

     

    5) You should now have a "?" in the post instead of the character it was in the editor.

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • I will test it myself using Ă

     

    A

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Ok, so I figured out what the issue is:

    Somewhere in the saving process, there is a VARCHAR() input parameter or variable being used to store the post content. Here is what is going on:

    1) Upon "Submit", page sends data encoded as UTF-8 (all is good): Ă sent as-is, all standard ASCII characters

     

    2) Page content is converted to UTF-16 / NVARCHAR / .NET string (all is good): still Ă , all standard ASCII characters

     

    3) Somewhere in here VARCHAR is used, whether as a parameter, variable, or even how it is stored in the table (I would hope not, but no way to tell at the moment). This step initially goes unnoticed because "Ă" is all standard ASCII characters.

     

    4) Click "edit", and HTML entities are decoded when pulled into the editor: what was being stored as "Ă" now becomes "Ă" in the editor.

     

    5) Upon "Submit", page sends data encoded as UTF-8 (all is good): Ă sent encoded in UTF-8

     

    6) Page content is converted to UTF-16 / NVARCHAR / .NET string (all is good): still Ă , but now encoded in UTF-16

     

    7) Somewhere in here VARCHAR is used, whether as a parameter, variable, or even how it is stored in the table. Now this step is noticed because "Ă" gets converted into "A". This tells us that a) VARCHAR is being used, and b) the database in which this is happening has a default collation that uses code page 1252 or 1250 or 850 (maybe a few other choices). I know this because "Ă" only has a best-fit mapping to "A" in those code pages, and will otherwise convert to "?" if using code pages 932, 936, 949, 1255, 1256, and maybe a few others (didn't test them all). I would guess that a Latin1_General collation is being used (hopefully in the "Latin1_General_100_xxxxxx" series and not "SQL_Latin1_General_CP1_xxxxx", though given that the default when installing is unfortunately still "SQL_Latin1_General_CP1_CI_AS", it might be that one).

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Thanks - we'll take a look.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply