2019-11-17

PowerShell script to Convert UTF-8 misinterpreted file names

You'd think that somebody else would have come up with a quick script to do just that on Windows, but it looks like nobody else bothered, so here goes.

Here's the deal: You copied a bunch of files, and somewhere along the way, one of the applications screwed up and did not produce actual Unicode file names but instead misinterpreted the UTF-8 sequences as CodePage 1252, resulting in something dreadful like this:


And now you'd like to have a quick way to convert the 1252-interpreted UTF-8 to actual UTF-8. So you look around thinking that, surely, someone must have done something to sort this annoyance, but the only thing you can find is a UNIX perl script called convmv, which isn't really helpful. Why hasn't anyone crafted a quick PowerShell script to do the same on Windows already?

Well, it turns out that, because of PowerShell's limitations, and Windows' getting in the way of enacting a proper conversion of 1252 to UTF-8, producing such a script is actually a minor pain in the ass. Still, now, someone has produced such a thing:
#region Parameters
param(
 # (Optional) The directory
 [string]$Dir = "."
)
#endregion

# You'll need to have your console set to CP 65001 AND use NSimSun as your
# font if you want any hope of displaying CJK characters in your console...
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8

$files = Get-ChildItem -File -Path $Dir -Recurse -Name

foreach ($f in $files) {
  $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes($f)
  $nf = [io.path]::GetFileName([System.Text.Encoding]::UTF8.GetString($bytes))
  Write-Host "$f" → "$nf" # [$hex]
  # Must use -LiteralPath else files that contain '[' or ']' in their name produce an error
  Rename-Item -LiteralPath "$f" -NewName "$nf"
}

# Produce a "Press any key" message when ran with right click
$auxRegKey='\SOFTWARE\Classes\Microsoft.PowerShellScript.1\Shell\0\Command'
$auxRegVal=(get-itemproperty -literalpath HKLM:$auxRegKey).'(default)'
$auxRegCmd=$auxRegVal.Split(' ',3)[2].Replace('%1', $MyInvocation.MyCommand.Definition)
if ("`"$($myinvocation.Line)`"" -eq $auxRegCmd) {
  Write-Host "`nPress any key to exit..."
  $null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown')
}

If you save this script to something like utf8_rename.ps1 in the top directory where you have your misconverted files, and then use Run with PowerShell in the explorer's context menu, you should then see some output like this (provided your console is set to codepage 65001, a.k.a. UTF-8 and that you select a font that actually supports CJK characters, such as NSimSun (Microsoft will really have to explain how they have no trouble displaying CJK with NSimSun but still can't seem/want to do it with Lucida Console):


Eventually, your file names should have been converted to their expected value, and all will be well:



That is, until someone who thinks it's okay to not properly support UTF-8 absolutely EVERYWHERE (Hey Microsoft, how about some UTF-8 Win32 APIs already?) screws up and forces people to manually unscrew their codepage handling yet again...

Bonus

By the way if you're using Windows 10 19H1 or later, you should know that Microsoft finally added a setting to set the system codepage to UTF-8, which seems to finally improve on the failed codepage conversions that prompted the above script. Even as it says that it's in Beta, you may want to enable it: