On temporary files in Ruby
how not to shoot yourself in the foot
We often use temporary files when building software: to store a report before sending it to a customer or uploading it to AWS S3; to download a CSV to parse into a database; or to store a picture of your beloved cat before sending it to the generative AI to put a silly hat on the unsuspecting feline. Today, I will talk about common mistakes with temporary files.
Table of contents
Using the file after disposing of the Tempfile
object
Let’s take a look at the following example:
def generate_report(rows)
tempfile = Tempfile.new("report")
CSV.new(tempfile, write_headers: true, headers: ["ID", "Name"]) do |csv|
rows.each { csv << [_1.id, _1.name] }
end
tempfile.path
end
# Somewhere else
path = generate_report(users)
s3_bucket.object("report.csv").upload_file(path)
The code passes QA, your tests are green, and you ship it to production. Then, the next day, you get an error in Sentry:
No such file or directory @ rb_sysopen - /var/folders/cg/nr4lp6j52tz30sthgzsvfmq00000gn/T/report20240821-56738-8pw09p (Errno::ENOENT)
Why is this happening? Because of the lifecycle of the Tempfile
object. When a normal file is created, written to, and then closed, the file persists on the disk. A temporary file, on the other hand, not only gets a unique name in a special place on the disk but is also automatically deleted (in most cases); the only question is when this happens:
- When the
Tempfile
object that defines finalizers is garbage collected. - When a block form of
Tempfile.create
is used to open the temporary file. - An explicit
unlink
removes the file from the file system. Interestingly, on POSIX systems, the file can beunlinked
before closing, which would remove the filesystem entry but keep the file handle open; this will ensure that the only processes accessing the file are those that already have it open.
Let’s consider the first scenario:
# Create a temporary file in a proc
path = -> { Tempfile.new.path }.call
# Start garbage collector
GC.start
# Let's try to read the file
File.read(path)
# => Errno::ENOENT
What about the block form?
path = Tempfile.create { |tempfile| tempfile.path }
File.read(path)
# => Errno::ENOENT
As expected, in the end of the block the file object is closed and the underlying file is removed.
And the most “wordy” and error-prone version of the Tempfile
usage:
file = Tempfile.new
begin
# ...do something with file...
ensure
file.close
file.unlink # delete the temporary file
end
Alright, with all this knowledge, let’s refactor the first example to no longer randomly break in production:
def generate_report(tempfile, rows)
CSV.new(tempfile, write_headers: true, headers: ["ID", "Name"]) do |csv|
rows.each { csv << [_1.id, _1.name] }
end
end
def with_temporary_file(name, &block)
Tempfile.create(name, &block)
end
# Somewhere else
with_temporary_file("report") do |tempfile|
generate_report(tempfile, users)
s3_bucket.object("report.csv").upload_file(tempfile.path)
end
With this change, we ensure that:
- The temporary file is deleted as soon as we no longer need it.
- The file stays on the file system for as long as we need it.
Garbage collection on Tempfile
objects
I mentioned that the file is almost always deleted, and that it depends on whether the finalizers were defined. How do you know if that is the case? Actually, it is pretty simple:
Tempfile.new
andTempfile.open
(which usesTempfile.new
under the hood) define finalizers that will automatically close and unlink the file.Tempfile.open
with a block will also close the file at the end of the block.Tempfile.create
does not define finalizers but will close and unlink the file when used in block form.
This means that Tempfile.create
is the most efficient API from the performance point of view for temporary files (no delegation, no finalizers), but also the only API to keep the temporary file on disk even after the Tempfile
object is collected by GC.
If you really, really need a temporary file that will not be garbage collected, and which you will manually track and delete when it is no longer needed (pinky promise!), use Tempfile.create
without a block. Returning to our first example,
def generate_report(rows)
tempfile = Tempfile.create("report")
CSV.new(tempfile, write_headers: true, headers: ["ID", "Name"]) do |csv|
rows.each { csv << [_1.id, _1.name] }
end
tempfile.path
ensure
# manually close to free the file descriptor
tempfile.close
end
# Somewhere else
path = generate_report(users)
s3_bucket.object("report.csv").upload_file(path)
# It is important to not forget to delete the file
File.unlink(path)
With this change, we have to manually close the temporary file to free the file descriptor and manually unlink the file to free the disk space. That’s a lot of responsibility to carry, and I would recommend avoiding it at all costs. Even if you can do it, it does not mean you should.
Using Tempfile.open
instead of Tempfile.create
Tempfile.create
is the recommended interface when working with temporary files in Ruby. It avoids multiple pitfalls that exist in both Tempfile.new
and Tempfile.open
:
- Avoids the performance cost of delegation incurred when
Tempfile.new
calls its superclassDelegateClass(File)
. - Does not rely on a finalizer to close and unlink the file, which can be unreliable.
- Ensures the file is unlinked at the end of the block.
The Ruby documentation says the following:
Tempfile.open
is still appropriate if you need theTempfile
to be unlinked by a finalizer and you cannot explicitly know where in the program theTempfile
can be unlinked safely.
When there is no block given, Tempfile.open
just proxies the call to Tempfile.new
, so it’s better just to use the latter instead. On the other hand, when called with a block, it ensures the file handle is closed (but does not unlink the file, relying on a finalizer and GC to unlink it). This means we cannot know when the file will be deleted unless GC is disabled and manually triggered because it can delete the file at any point in time, and it is simply not safe to use it!
If you plan to use a file, hold on to the Tempfile
instance until you no longer need it!
Re-opening the temporary file even if it is already open
When using Tempfile
, Ruby not only creates a temporary file but also opens it in w+
mode (read, write, positioned at the end). This means you can immediately start writing to it!
# bad
Tempfile.create do |tempfile|
CSV.open(tempfile.path) do |csv|
csv << row
end
end
# good
Tempfile.create do |tempfile|
CSV.new(tempfile) do |csv|
csv << row
end
end
The difference is that in the first example we will have two file handles open in the process. On POSIX systems, the number of file descriptors any process can hold is limited (can be configured, but still limited). If we use two file descriptors for each of our temporary files, and it is a Sidekiq process with multiple concurrent workers generating our quarterly pay stubs — the process can (and will) run out of descriptors. It will look like this:
300.times.map { Tempfile.new }
# => Too many open files @ rb_sysopen - /var/folders/cg/nr4lp6j52tz30sthgzsvfmq00000gn/T/20240822-60283-9a00gd (Errno::EMFILE)
The default soft limit for the number of open files is 256 on my OS X machine, and I simulated actual processes holding open files by storing them in an array. If we re-open the already open file and allocate a second descriptor, we will only be able to open half that number. Also, please note that descriptors are also used for sockets (database and Redis connections, incoming web requests for your Puma server, log file, even stdin, stdout, and stderr use them).
Not closing the temporary file after use
We just talked about it. File descriptors are a precious resource, and we should handle them with care.
# Bad: the file will not be closed and removed immediately
# after the use and relies on GC to clean it up
def generate_and_upload_report
tempfile = Tempfile.new
tempfile << "Hello"
upload(tempfile)
end
This code depends on the garbage collector to both close and unlink the file. It will eventually happen, but on any system with a high load, the GC might not trigger fast enough, and you could run out of file descriptors or disk space.
Not specifying a file name or extension
It is not necessary to specify any parameters when using Tempfile
. The following code is perfectly valid and works amazingly well:
Tempfile.create do |f|
f << "Hello, world"
end
The downside is that if you are dealing with a production incident, for example, Errno::ENOSPC: No space left on device
, you need all the help you can get. Looking at a directory that only has files like 20240822-60283-j8dbll
is not helpful at all!
Let’s fix it:
Tempfile.create(["user-report-#{Current.tenant_id}-", ".txt"]) do |f|
f << "Hello, world"
end
Here, we passed an array of two strings, which Tempfile will use as a prefix and a suffix for the filename:
user-report-1234-20240822-60283-r7wsje.txt
It is much easier to see where the files are coming from, what their contents might be, and we can proceed with addressing the root cause of that pesky disk space problem.
Wrapping up
To conclude, here is how I would recommend using the Tempfile
:
Tempfile.create(["report-", ".csv"]) do |tempfile|
CSV.new(tempfile) do |csv|
# ...
end
end
If you don’t want any other process to access your file, put tempfile.unlink
as the first line in the block. Please note that you also will not be able to reopen the file after that.