Get file encoding using "file -b --mime-encoding"

Hello!

I’m developing an application, that crunch data that users uploads. So CSV file uploading is on its core. When we’re talking about users uploading files, file encoding always come to my mind.

My first choice was to restrict things to UTF-8 and replace incompatible chars by “?”, but it would create a terrible UX. So I decided to put a dropdown on user’s settings page, but since users are not from TI area, they would never pick it correctly.

So, after googling it for a while, I found out that running

file -b --mime-encoding /path/to/file

would get me its encoding. So I wrote the following function to do that in Ruby:

def get_file_encoding
  self.file.cache!
  encoding = `file -b --mime-encoding #{Rails.root}/tmp/uploads/#{self.file.cache_name}`[0..-2]
  if %w{iso-8859-1 us-ascii utf-8}.include? encoding
    encoding
  else
    "utf-8"
  end
end

Since the files are stored in Amazon S3 through Carrierwave, I first need to cache the file locally, so I can run the command. So the first line is caching the file, and the second is getting the encoding. The if clause at the end is there so I can limit the encodings I will accept (since I don’t know what can be the return of that command - maybe it will return nothing when the encoding isn’t specified on the file), and returns “utf-8” if it’s not any of that options.

I’m not sure if that is a good solution.

  • When may that command (file --mime-encoding) not work?
  • Do I need to erase the cache file later? Since I’m using Heroku, I believe it does the job of cleaning the tmp folder better than I would, since they’re so restrictive about it.
  • Is there any better solution to get the file encoding?

Thanks!

That solution looks pretty reasonable to me. I’d push it up to a staging environment and play with it for a while to make sure it’s working.

As far as your questions:

When may that command (file --mime-encoding) not work?

A brief skim of the man page yields “file returns 0 on success, and non-zero on error.” If you’re concerned about file failing, maybe you should raise an exception if it returns non-zero.

Do I need to erase the cache file later?

Heroku’s docs indicate that they wipe out stuff in ./tmp. But it wouldn’t hurt to clear the cache yourself if that method is the only one that needs it. I might have an ensure block do that.

Is there any better solution to get the file encoding?

So I just googled, and it looks like can do this in ruby:

File.read('foo.csv').encoding => #<Encoding:UTF-8>

Hey Ben, thanks for your answer!

I’m thinking of converting all files to UTF-8 before Carrierwave stores it on Amazon S3 using a processing action of Carrierwave, that way I wouldn’t have to call cache, and Carrierwave would deal with that properly. I would also have UTF-8 as a standard on my storage.

About the encoding method you cited, that was my first try, but it always return Encoding:UTF-8.

Thanks!

My pleasure. Glad I could help!