Importing data from CSVs - Tips and Tricks for Rubyists

CSVs are great for importing and exporting data but often have quirks relating to formatting. This article is a guide on importing data and the issues I've run into in my own experiences.

It's all data? Always has been!
CSVs are really useful for importing large sets of data. We regularly use them for various tasks at OOZOU. Ruby provides a really great CSV library for both processing and creating CSV files. I'm not going to go over what a comma separated values file is here. I assume if you're reading this you probably already know. If not read the wiki link above.

There are a couple of formatting mistakes that we often see here at OOZOU. Fairly often clients will send us CSV files that have been generated in Microsoft Office and have formatting issues like empty lines or values that are not correctly escaped e.g.

name,description,product_code
Bookcase,This bookcase is unique, stylish, and great value for money,CD0018394
Computer Desk,This desk is 50"x180" and made from pine, aluminum, and plastic,CD0190049

There are two examples of corrupt data. Line 2 has too many commas so our processor won't know which field maps to which value. We can solve this by wrapping the value in quotes. However, our 3rd line has quotes and we need to double quote those to escape them, e.g.

name,description,product_code
Bookcase,"This bookcase is unique, stylish, and great value for money",CD0018394
Computer Desk,"This desk is 50""x180"" and made from pine, aluminum, and plastic",CD0190049

You can also specify a separator and reserve commas for values instead. This raises one of the first considerations we need to agree on when dealing with CSVs. 

Ensure your CSV files have a static format


CSV headings are optional. We can tell our processor what we're expecting or we can tell the processor that the first line contains the headings.

CSV.parse('path/to/csv.csv', headers: true)
CSV.parse('path/to/csv.csv', headers: :first_row)

The above code will find the headers from the first line, so we expect a file like

name,description,product_code
Bookcase,"This bookcase is unique, stylish, and great value for money",CD0018394
Computer Desk,"This desk is 50""x180"" and made from pine, aluminum, and plastic",CD0190049

However, if our CSV provider decides to drop the headers our parser will have weird heading names that our scripts won't understand. We’ve also had an issue where our CSV provider changed the headers slightly so our scripts didn’t recognize the values.

We can manage this another way, too.

CSV.parse('path/to/csv.csv', headers: [:name, :description, :product_code])

Now we've told the parser that the file doesn't have headers but we are expecting the values to be in the following order. Now we have the opposite problem that a file with headers will now have an invalid value at the beginning. We also have an issue that ordering is important now. If our CSV provider suddenly starts providing the name in the last slot, our script fails yet again.

These issues can be easily fixed. The solution is simple -  agree on a format with the CSV provider. In theory this is simple and logical. However, I've had more than one occasion where the format changed without any notification, so it's always worth doing some validation on the expected values and raising errors if they are invalid.
Now everyone has emails for names... in production!

Format your headers and values


When parsing CSVs we can tell ruby to do processing to both the headers and values. If I can request specific header names I also like to pass the header_converters option to convert the string keys to symbols which makes fetching values more ruby like.

csv = CSV.parse('path/to/csv.csv', headers: true, header_converters: :symbol, converters: [:integer, :float, :date_time]).to_h
csv.each do |row|
  row[:name]
end

These converters will check the values to see if they can be converted to integers, floats, or date_times. There's a couple of other built in converters, too.

However, a lot of the time you want to do more advanced converting. You can create a custom converter and convert each field differently quite easily:

my_converter = -> { |value, field|
  case field.header
  when :name
    value.to_s.capitalize
  when :contact_number
    PhonyRails.normalize_number(value, country_code: 'TH')
  when :created_at_utc
    Time.zone.parse(value)
  else
    fail("Unknown field name #{field.inspect}=#{value}")
  end
}

csv = CSV.parse('path/to/csv.csv', headers: true, header_converters: :symbol, converters: [my_converter]).to_h

Log everything


When importing data it's important to log as much as you can because you'll probably forget to log some important piece of information that is only obvious in hindsight. 
Oops, I just sent out 5000 twilio messages and I don't know who I sent them too!


I typically run my import scripts as either rake tasks or through rails runners. Create a logger that logs to a file somewhere on your system. Include the date in the name! Including the date prevents you from accidentally overwriting the file if you have to run your script more than once. It also gives you a reference for when the script was actually run.

If your script is going to log gigabytes of data you might want to think about having multiple log files for different steps in the script or setup logrotate to rotate the file hourly.

Make sure your scripts are idempotent


A great way of doing this is to first import your CSV data into an intermediary table and then process those rows in a separate task. This allows you to bulk import with gems like activerecord-import and then each row in the table can have a processed boolean field to record if it was processed or not. You can also create a db table to record steps taken on each row similar to how gems like statesman have transition associations to record transitions with metadata about each action.

I said idempotent not omnipotent

Show progress or go nuts


If you're importing a 100’s of thousands of rows it can take hours. It's really nice to be able to get a progress report. If you can't see the progress you might wonder if it's still running and consider quitting the script - I know from experience. The best real time solution I came up with was a progress bar in the terminal. You can use tools like tty-progressbar to show progress in a variety of ways. It can even predict how long your script will take to complete based on the current rate at which rows are being processed. 

Need to import data into your application or service?


We're experts here at OOZOU. We've been importing data from APIs, FTP servers, and even emails for years. If you are looking for a team capable of building a robust solution for you, get in touch below

Ready to start your project? Contact Us

Like 2 likes
Joe Woodward
I'm Joe Woodward, a Ruby on Rails fanatic working with OOZOU in Bangkok, Thailand. I love Web Development, Software Design, Hardware Hacking.
Share:

Join the conversation

This will be shown public
All comments are moderated

Get our stories delivered

From us to your inbox weekly.