I usually write more about marketing/sales than I do about actually making software products, but I have recently been working on the product side a little more intensively for Appointment Reminder. One of the features we implemented was CSV upload. This is a very, very common task for virtually every B2B SaaS product, so I thought I’d share how we did it, as I’m pretty happy with how it is working out. Hopefully it will be useful to some of you.
The Problem With Upload Interfaces
Substantially every B2B SaaS product benefits from interoperability with other recordkeeping systems your customers use, including “formal” software (your competitors or products you interoperate with) and “informal” software like e.g. spreadsheets, Trello lists, and email-inboxes-used-as-a-database.
This is particularly true early in your customer’s lifecycle with your company: most data which will be new to your system is not actually new. It presently exists somewhere in the customer’s organization. You presumably want to make it as easy as possible for them to put it into your system and make your system the “source of truth” about that data.
Frequently, savvy SaaS entrepreneurs do this with “concierge onboarding” — basically, using high-touch human handholding to substitute for features which do arbitrary data source to your DB schema importing. Why? Partially because the human touch earns a lot of customer loyalty. Partially because you can offer concierge onboarding as soon as you have one smart person who has an inbox and free time, without necessarily having to build a whole lot of software support for them.
Why I Punted On CSV Import for 4 Years
My launch list of features for Appointment Reminder included CSV import for client data (names/phone numbers/emails) and for appointment data (client data plus date/times of appointments). Unfortunately, I assessed this feature as probably costing $100,000 in engineering time to implement well, so I punted and implemented a quick “concierge upload feature.”
This was backed by me literally SCPing the files to the server, then opening the Rails console, and typing commands to parse out the file format in real time. It would typically look something like:
lines = File.readlines("/home/patrick/new-upload.csv")
u = User.find_by_email "someone@example.com"
a = u.account
records = lines.map {|line| line = line.split; record = {}; record[:name] = line[3]; record[:email] = line[7]; record[:home_phone] = record[:mobile_phone] = line[12]; record}; records.size
records.last #Check to make sure that command actually worked.
clients = records.map {|record| c = a.clients.build; c.name = record[:name]; c.home_phone = "blah blah blah"; c}; clients.size
clients = clients[1..(clients.size - 1)] #Skip the header line
clients.map {|c| c.save}; "Clients saved!"
errors = clients.select {|c| c.errors.present?}; errors.size
#While there are still records with errors
c = errors.pop
puts c.inspect
#c.email is "bob@gmail", needs a .com
c.email += ".com"
c.save
#Worked!
c = errors.pop
#Blah blah blah.
For a well-behaved CSV file, this took about 5 to 15 minutes of work. For clients who had very unclean CSV files, I often ended up doing an hour of data entry into IRB. That certainly makes sense for clients with predicted LTVs of $5,000+, but ideally I wouldn’t be doing this sort of work when I could be doing other stuff to drive the business forward. Additionally, the latency in sending me an email and actually having one’s account ready was 24~48 hours in the best of times, and that often caused clients to seek alternatives.
So why didn’t I just have a generic CSV importer ready? Because it is hellaciously difficult to do CSV import well in the general case.
Why is this?
- Column-to-column mapping: Clients’ existing understanding of their data very rarely matches with your understanding of how the same data is organized, so you have to map their columns to your columns. This is often not a one-to-one mapping. For example, Appointment Reminder gives each client a single freeform name field, but many customers have systems which differentiate between first and family names (presumably because they think those things exist), so you have to have a way to stitch together those columns.
- Pervasively unclean data: Informal software often includes data which is not exactly machine readable. Email addresses like bob@aol. Phone numbers like [(555) 555-5555 or -5556 (his mother)]. Names like [” Catherine] which make sense if you’re a human seeing it under [Smith James] but make no sense to a CSV file reader (and, incidentally, will frequently choke it).
- File formats: You would think the CSV file format is fairly well-standardized. Hah. Hah. Hah. cries
As a result, the typical web upload workflow is very inadequate for CSV file upload. You’d like to just give them an upload control, grab the file, pass it to a background process, then update the UI with the result, but often the result is going to be “Lines 37, 45, and 392 had problems. Also, although you wouldn’t have expected it, we left off all the email addresses because you picked the wrong column. Whoopsie! Edit the file then re-upload.”
This is enormously frustrating for the user.
Enter SheetJS and Handsontable
I can’t remember how I stumbled upon SheetJS, but as soon as I saw their tech demo, I knew AR was finally going to get file uploads.
SheetJS is an OSS project which parses a variety of common spreadsheet formats, including (through libraries) Excel, OpenDoc, and CSV. Because it is Javascript, you can actually do this directly in the browser, which allows me to completely eliminate the “upload” stage of the “Upload your CSV/etc, we parse it, then we display feedback” workflow.
I can’t express how magical this is in mere words. You have to see it in action. Either use that tech demo or watch the following ~50 second video.
I’ve described this to my friends as “It’s like embedding Google Sheets into your application, for less than a day of work.”
Carefully Considered Glue Code
I had “minimal” import capabilities available in about 3 hours of work with SheetJS, but getting a decent editing workflow with Handsontable (the grid component which provides a lot of the “magic”) took a solid three days of work.
Why?
The above example shows a well-behaved Excel file. We heuristically guess what columns in the Excel file map to which columns in our data model. Files which are a) parsed and mapped 100% successfully and b) contain no errors at all… are not the most common case with CSV uploads.
I had to build a way for users to both a) quickly understand that the column mapping was not correct (ideally, without having them have to understand the phrase “heuristics used to parse your document”, since most are office managers) and b) correct it, without requiring a heck of a lot of explanation.
UX-wise, I figured that, as long as we were using the typical right-click context menu that people would be familiar with from Excel, we’d extend that with the ability to mark columns appropriately. Additionally, we’d add a bit of external-to-the-grid feedback to the user to make it obvious what their file was missing.
Then there’s the question of “What do we do about errors?”
I figured that, for AR’s particular use case, errors in a given line shouldn’t block processing of the other lines, since customer records typically don’t “bleed” into each other. Accordingly, we’d save all records which had no errors (similar to my IRB session above), then ask the user to bulk update records with errors and try again.
This is handy, since customers often have very consistent forms of errors in their documents, and the remedy for them (e.g. deleting an entire column of bad data) is often fairly easy to do in a full-featured spreadsheet interface. Putting that in the browser rather than back in Excel makes it minimally difficult for them to actually succeed in doing this.
Take a look at this 65 second video for a user who uploads 98 users successfully while also having two records with errors. (Don’t worry about violating anyone’s privacy: all of the data is faked using GenerateData.com, a really handy service for cranking out CSVs/Excel files/etc which exercise various features in upload features.)
I rush through the process pretty quickly for the purpose of the demonstration, but in real life, it is close to this easy, particularly for a user who has done this before (like, e.g., my support reps or myself — see below).
What was tough about the glue code? Mostly, that it involves a lot of jQuery callback hell. Javascript development is, in general, not exactly my strong suit. 80%+ of development time for this feature was making sure that right clicking on the mobile column to mark it as mobile actually worked as expected. You won’t believe how many reloads I did during this process.
I give an A+ to Handsontable for documentation quality and a well-designed JS API. This project would have been enormously frustrating without that. SheetJS is a little newer of a project with a substantially broader brief, so it involved a little more source code spelunking to get working (particularly when integrating new file formats into the provided dropsheet.js magic which imports the dropped data into Handsontable), but I’d rate it as production-quality. Both sets of devs are quite responsive and substantially every B2B SaaS product should adopt them (hopefully adding a bit more polish on the UX than I did).
What Does The Backend Look Like?
Customers routinely upload thousands of records at once (which, n.b., neither of these OSS projects have any problem with). Given that my Rails app treats each row in isolation rather than using a batch insert statement, mostly to run validations, this means that doing the upload parsing in the request/response cycle is probably not optimal. You never want a request waiting several seconds unless you’re intentionally doing long polling.
I instead immediately stuff the uploaded data into a semi-ephemeral data store (Redis), then acknowledge to the AJAX request “OK, got it. ID number was $FOO.” The client then begins polling an endpoint for status updates about $FOO.
(Worth mentioning: why Redis as opposed to e.g. just persisting the files to disk? One, our Redis database is encrypted and our standard file storage is not, so not putting the files on disk saves us from having to worry about whether they include HIPAA-privileged data at all. Two, while we could theoretically clear out uploads on a regular basis, Redis has the expiry logic already built into it, so the data has built-in ephemerality. Three, it’s easy for my app to assert “We’ll always have a Redis instance available!” but not quite so easy to assert “We’ll always have a local filesystem shared by all processes relevant to servicing a particular user’s account” — if we eventually move to a multi-machine architecture then the file system suddenly becomes a really bad place to be storing uploads. Four, uploaded files are inherently a security risk if a misconfiguration lets someone execute upload.csv as e.g. a PHP file, and I’m decently certain there is no URL which will ever route to an arbitrary key in Redis and cause it to be evaluated as anything other than a string.)
Our queue worker processes then process upload $FOO, updating the database directly for working records and saving errored records as a new job (also in Redis). We then have the client poll result return both instructions to the user and new records to repopulate the table with. While our upload infrastructure at the model level is reasonably well architected with separations of concerns and everything, the bit which synchronizes the jobs and the UI is incredibly tightly coupled between the UI and the controller layer. That decision will eventually make maintenance of this more painful than it would be if they were decoupled. I shipped it anyhow, because shipping beats not shipping.
To give you a rough idea of complexity:
upload_processor.rb contains the logic orchestrating jobs, persisting to/from Redis, and turning JSON blobs which were POSTed to the server into client records. It is 335 lines long, and required approximately 20 lines of additions to our Client class to support. Unit tests run another 500 or so lines.
uploads_controller.rb turns AJAX requests into upload processor method calls. It is 135 lines, of which 35 are the incredibly coupled method for answering poll requests. I’ll avoid showing a screenshot for fear of being blocked as obscenity in the offices of good engineering teams everywhere.
The Javascript glue code is 450 lines long. No unit tests because, frankly, I have no clue how one would test this in an automated fashion. I’d rate it as C+ in terms of code quality — jQuery callback hell, what can I say.
Rolling Out To Customers
I don’t like exposing new features directly to customers, particularly features which I suspect are probably fiddly. For example, while I’ve tested this feature with documents in a variety of encodings and file formats, I have no clue what will happen if someone uses an outdated version of MS Excel on a Windows 98 machine to upload a file written in a Hebrew code page, and I’m honestly a little scared of finding out.
Accordingly, we put this behind a feature flag. Feature flags are structurally similar to A/B tests:
#In a view. @account is, by convention, the currently logged in account.
#Account#allows?("feature") is our feature flag method.
<% if @account.allows? "uploads" %>
Click here to <%= link_to "upload", {...} %> your client data.
<% end %>
Naturally, we also control access at the controller level (to prevent anyone from playing guess-the-URL), but this demonstrates the basic idea. We can assign users access to this feature individually or in groups, without having to roll it out to the entire userbase.
At present, client data uploads are available in production for our internal users only. Our administrator accounts can upload data into any account, but users can’t access the corresponding upload feature for their own accounts unless I hand-grant their account that permission.
This forces clients to continue emailing us their CSV files to get them uploaded, which lets me have the catastrophic, terrible UX, blow-up-straight-in-my-face errors that still exist in this feature happen to a patient and incredibly invested product owner (me) rather than an impatient user on their first day. For example, it wasn’t obvious to me, but if JSON interpreted someone’s phone number as an integer (5555555555) rather than a string (555 555-5555), our upload job would die with an uncaught exception and the polling method would continue polling forever. I fixed that and the user never even knows it happened.
This also lets us verify that the parsing/heuristics/etc work for someone’s particular “informal software” used at their business. Typically, they have one or a handful of Excel files, often maintained on a single version of Excel on a small set of machines. We can handle the first upload ourselves, verify that it works as expected, and then extend them the ability to do uploads. It is unlikely that they’ll add row-level data which comprehensively borks the upload feature assuming it works on their existing files at least once.
What’s Next In Our App
Now that we have the ability to do this for client data uploads, which are (by far) the easier of the two upload types for us, we’re going to deal with appointment data uploads next. These are a) less forgiving of error than client data, b) conceptually harder to deal with (parsing date/times… ick), and c) vastly more numerous than client data, since client data is typically essentially append-only with minimal editing after upload but appointment data changes on a multiple-times-a-day basis.
Luckily, we’ll be able to reuse a bit of the infrastructure which we built and also will, hopefully, have shaken out a lot of the UX/error handling/monitoring/etc challenges prior to playing the upload game on hard mode.
Presently, we have a single set of heuristics/parsers/etc which are used for all clients. Eventually, I’d like this to be plug-and-play on a per-customer basis, so that we could write ~50 lines of Ruby if required to slurp in a godawful file format required to support a particular enterprise client. c.f. the Strategy pattern in the GoF book, except less painful, because metaprogramming makes this much less painful than it would otherwise be.
A Brief Meditation on OSS
I estimated this feature as costing $100k in engineering time if I were scratch building it. We got it done in +/- 3.5 days of work, which is easily a 90%+ savings, as a result of SheetJS and Handsontable being available. I simply didn’t feel right getting that amount of value for free from two projects which are run by very small teams, so I approached both and convinced them to sell me an enterprise license to their project. It is equivalent to the usual OSS license, except it comes with an invoice. (I don’t think it would be appropriate to name numbers, as that might constrain their ability to price enterprise licenses going forward. Let’s say that my initial outreach emails — titled “Can I pay you to work on $PROJECT?” — probably sounded like a less-wealthy-than-usual Nigerian prince who happened to live in Tokyo and have a really odd interest in CSV files.)
If you have an OSS project which is as useful for for-profit businesses as these two projects, I would strongly, strongly advise taking down any mention of “donations” and offer commercial licensing for the project. This doesn’t have to change anything about your project. (What is the difference between a commercial license that is completely discretionary and a donation? I can donate money, and like most middle class people, my actual behavior with regards to donations is “occasionally donates tens of dollars to poor people and other deserving causes.” My company is literally not allowed by the tax code to donate money, but it can buy any software it feels like, it does not require “You must be poor” to write a check to you, and it assumes that software costs hundreds/thousands of dollars. Ask for license revenue, not for donations.)
If you’re willing to let commercial viability influence your choice of licenses, the dual licensing model also works pretty well, which my buddies at Binpress explain in more detail here. (Disclaimer: I’m a small investor in Binpress.) For example, Appointment Reminder would be totally unwilling to put GPLed code anywhere in our codebase (viral infection of the rest of the code is a non-starter for me), but if you (the copyright holder) said “We’ll let you use our code on a basis equivalent to an MIT license, if you pay us $X,000″, then (assuming the code solved a $X,000 problem for me) I’d write a check immediately.
More broadly, I think that SaaS businesses and other heavy consumers of OSS owe it to the community to provide the funding for projects which represent significant advances, as we are — frankly — much better than the typical OSS dev at actually monetizing software. I’m less worried about the likes of Ubuntu or Chrome, which have massive corporate backing behind them, and even much smaller projects like Rails do pretty well with a few larger sponsors (Basecamp, NewRelic, Heroku, etc) which full-time employ the largest contributors. I think it is right and proper for for-profit businesses to assist the labors-of-love OSS projects at the lower end of the scale though in going full-time on those projects if that meets their goals. So, where possible, I try to pay professional wages for professional work. There exist a variety of ways to contribute to OSS projects, but nobody’s landlord accepts pull requests as currency, so I prefer contributing with money.
Try it sometime — it’s fun and easy. I wrote two emails explaining that I wanted to buy a commercial license and that my only requirement from them was an invoice which said “commercial license for $SOFTWARE” and a figure on it. After collecting the invoices, I had my bank send checks/wire transfers as appropriate. Payments between parties in the global wealthy class (i.e. most software companies and developers) is a solved problem — don’t let whole minutes of hassle scare you away from trying it.

Have to say, that I think you missed a big trick here. We create Import Templates for users – its a CSV file that the client must download and it has the columns in the right order, etc. Clients then massage their own data into our template. To my knowledge not one single client has ever complained about this because it takes them all of 5 minutes to cut-and-paste their own columns of data from one spreadsheet to another.
When it comes to importing we have a template guaranteed to be in the right format (yes, we do a quick check that all columns are there and in the right order).
Thanks for sharing this. What requirements do you have about browser compatibility? I’m searching on the SheetJS for information about IE9 support.
Jon, thanks for sharing that technique, creating an Import Template! Great idea.
Zachary
Great tip. I am a one men dev shop and donate to OSS projects and yes, my accountant complains every single time. Never thought of asking them and invoice. Thanks!
> The Javascript glue code is 450 lines long. No unit tests because, frankly, I have
> no clue how one would test this in an automated fashion.
I try to write Javascript Unit tests where I can, but in the upload code I’ve written, the unit tests don’t say much about if it will work, and were way too hard to write for the benefit.
What I do for testing the particularly nasty parts is have a third deployment of the entire system (first being development and second being production). I then run a VM with Linux and Chrome in it with ChromeDriver and a script (I use python, but ruby is supported as well) to remote control Chrome for testing my software.
Firefox is equally easy to setup. IE is considerably harder, so I don’t bother.
The script is completely controlling a normally running Chrome, so even though this could run on Windows or OSX, if you run it directly on your desktop/laptop, trying to use your desktop/laptop while it is running will interfere with the tests. That is why I use a Linux VM to run it.
While you can spend endless amounts of time perfecting and taking it further, I bet that in about an hour or two, you could go from nothing to a few upload tests that you run by manually launching the script in your linux VM.
This is also how I run my actual javascript unit tests. A python script opens Chrome pointed at the index.html file for the tests, waits for the tests to finish running in the browser, then scrapes the table of results to report what the failures are. This gives me something that can be put in a git commit hook (It takes a minute or two, so it is a post commit report, not a condition of the commit finishing).
Great article!
One quick question: I had ruled out redis for HIPAA-sensitive data because I didn’t think it supported encryption out of the box (their docs say “Redis does not support encryption” in the Data Encryption section of http://redis.io/topics/security).
So did you find another way to encrypt data stored in Redis? (And if so, why wouldn’t that have worked on the file system?)
Thanks!
Hiya Jeff,
We use ecrypt-fs, which encrypts the redis data file (dump.rdb) when it is at rest. Access to Redis only happens over an encrypted link (tunneled over SSH), so that has us in the clear when it is in flight. My read of the HIPAA regulations and my legal advisors’ opinion is that this has us covered — the data is theoretically unencrypted in memory in the Redis process, but if an attacker ever gets arbitrary memory read on our DB server, we’re screwed anyhow.
We could use ecypt-fs for parts of the file system, but I’m trying to keep the number of things we have to track to the minimal number possible.
Got it – thanks for the update Patrick, that makes a lot of sense.
Patrick,
great article. It’s something I have to do for my CRM SaaS as well. Right now I concierge my user (yes singular) as well. Everytime he wants to upload new data he happily sends me a CSV and I import it.
Gotta write some software that does it for me. :D
Awesome write up Patrick. I’d been meaning to write something similar with my trials and travails with user CSV uploads. We’ve had user-facing CSV uploads since shortly after MVP and it’s been unbelievably complicated at times. One hard won lesson was, for the love of god, don’t try to do this server side with Ruby CSV.
Right now we use PapaParse (papaparse.com) to convert CSV to JSON client-side and then clean and upload the data row by row in a Resque job, emailing back a CSV of any invalid rows with explanations.
Love to hear how it goes when you roll out to users and if the extra fancy SheetJS UI makes it usable for them (and less work for you).
Excellent kicker on funding OSS as well.
Blah
a
ه꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲҉꙰꙱҈̿꙲
̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿ ̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿̿