Get file encoding using "file -b --mime-encoding"

jdanielnd · May 21, 2013, 2:06pm

Hello!

I’m developing an application, that crunch data that users uploads. So CSV file uploading is on its core. When we’re talking about users uploading files, file encoding always come to my mind.

My first choice was to restrict things to UTF-8 and replace incompatible chars by “?”, but it would create a terrible UX. So I decided to put a dropdown on user’s settings page, but since users are not from TI area, they would never pick it correctly.

So, after googling it for a while, I found out that running

file -b --mime-encoding /path/to/file

would get me its encoding. So I wrote the following function to do that in Ruby:

def get_file_encoding
  self.file.cache!
  encoding = `file -b --mime-encoding #{Rails.root}/tmp/uploads/#{self.file.cache_name}`[0..-2]
  if %w{iso-8859-1 us-ascii utf-8}.include? encoding
    encoding
  else
    "utf-8"
  end
end

Since the files are stored in Amazon S3 through Carrierwave, I first need to cache the file locally, so I can run the command. So the first line is caching the file, and the second is getting the encoding. The if clause at the end is there so I can limit the encodings I will accept (since I don’t know what can be the return of that command - maybe it will return nothing when the encoding isn’t specified on the file), and returns “utf-8” if it’s not any of that options.

I’m not sure if that is a good solution.

When may that command (file --mime-encoding) not work?
Do I need to erase the cache file later? Since I’m using Heroku, I believe it does the job of cleaning the tmp folder better than I would, since they’re so restrictive about it.
Is there any better solution to get the file encoding?

Thanks!

r00k · May 21, 2013, 4:01pm

That solution looks pretty reasonable to me. I’d push it up to a staging environment and play with it for a while to make sure it’s working.

As far as your questions:

When may that command (file --mime-encoding) not work?

A brief skim of the man page yields “file returns 0 on success, and non-zero on error.” If you’re concerned about file failing, maybe you should raise an exception if it returns non-zero.

Do I need to erase the cache file later?

Heroku’s docs indicate that they wipe out stuff in ./tmp. But it wouldn’t hurt to clear the cache yourself if that method is the only one that needs it. I might have an ensure block do that.

Is there any better solution to get the file encoding?

So I just googled, and it looks like can do this in ruby:

File.read('foo.csv').encoding => #<Encoding:UTF-8>

jdanielnd · May 21, 2013, 7:08pm

Hey Ben, thanks for your answer!

I’m thinking of converting all files to UTF-8 before Carrierwave stores it on Amazon S3 using a processing action of Carrierwave, that way I wouldn’t have to call cache, and Carrierwave would deal with that properly. I would also have UTF-8 as a standard on my storage.

About the encoding method you cited, that was my first try, but it always return Encoding:UTF-8.

Thanks!

r00k · May 22, 2013, 3:08pm

My pleasure. Glad I could help!

Topic		Replies	Views
CSV S3 file parsing incompatible character encodings: ASCII-8BIT and UTF-8 Ruby on Rails	5	4125	July 25, 2013
I am getting an undefined method 'encoding' error Ruby on Rails	1	1154	July 1, 2013
Properly Diagnosing the source of file MIME-type problems inside Rails Ruby on Rails	4	2016	September 27, 2013
"invalid byte sequence in US-ASCII" when issuing rake commands Ruby on Rails	3	5204	June 20, 2014
Set encoding database connections Ruby on Rails	1	512	August 27, 2013

Get file encoding using "file -b --mime-encoding"

Related topics