Hello!
I’m developing an application, that crunch data that users uploads. So CSV file uploading is on its core. When we’re talking about users uploading files, file encoding always come to my mind.
My first choice was to restrict things to UTF-8 and replace incompatible chars by “?”, but it would create a terrible UX. So I decided to put a dropdown on user’s settings page, but since users are not from TI area, they would never pick it correctly.
So, after googling it for a while, I found out that running
file -b --mime-encoding /path/to/file
would get me its encoding. So I wrote the following function to do that in Ruby:
def get_file_encoding
self.file.cache!
encoding = `file -b --mime-encoding #{Rails.root}/tmp/uploads/#{self.file.cache_name}`[0..-2]
if %w{iso-8859-1 us-ascii utf-8}.include? encoding
encoding
else
"utf-8"
end
end
Since the files are stored in Amazon S3 through Carrierwave, I first need to cache the file locally, so I can run the command. So the first line is caching the file, and the second is getting the encoding. The if clause at the end is there so I can limit the encodings I will accept (since I don’t know what can be the return of that command - maybe it will return nothing when the encoding isn’t specified on the file), and returns “utf-8” if it’s not any of that options.
I’m not sure if that is a good solution.
- When may that command (file --mime-encoding) not work?
- Do I need to erase the cache file later? Since I’m using Heroku, I believe it does the job of cleaning the tmp folder better than I would, since they’re so restrictive about it.
- Is there any better solution to get the file encoding?
Thanks!