Hot Reloading Code in Erlang: How Does It Work?
Table of Contents
Engine schematic for the Paraguay Railway. I took this back in December 2024. Hot reloading is a bit like swapping out the engine while the train is moving.
A while back, I fixed a prod outage with hot reloading code in the BEAM VM, and earlier today I caught a random heisenbug using it1. This post covers how Erlang hot reloading works, with demos on migrating state and hot reloading TCP servers without dropping connections.
1. How does Erlang store code?
One of the core primitives in Erlang is called gen_server. This is effectively a single "actor" if you're familiar with the actor framework. A single gen_server holds some stateful information in a variable called State2. A gen_server communicates with other actors through messages, and there's a concept of a mailbox, and performing RPC calls depending on the specifics of a message.
The Erlang Runtime System (ERTS) uses a code server, basically a running process that keeps track of compiled code that's been loaded into the BEAM VM. The system is set up to keep two versions of code, a "current" version and an "old" version. What this means is that if you're an actor running the "old" version of the code, your local calls will still get routed to the "old" version, while newer fully qualified remote calls will get routed to the "current" version of the code.
2. Basic state migration
You'll often want to change the information stored in State, or the structure of it. For example, consider this code:
-module(counter).
-behaviour(gen_server).
-vsn("1.0.0").
-export([start_link/0, get_count/0, increment/0]).
-export([init/1, handle_call/3, handle_cast/2, code_change/3, terminate/2, handle_info/2]).
%% -- API --
start_link() ->
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
get_count() ->
gen_server:call(?MODULE, get_count).
increment() ->
gen_server:cast(?MODULE, increment).
%% -- Callbacks --
init([]) ->
{ok, 0}.
handle_cast(increment, Count) ->
{noreply, Count + 1}.
handle_call(get_count, _From, Count) ->
{reply, Count, Count}.
code_change(_OldVsn, State, _Extra) ->
{ok, State}.
handle_info(_Info, State) -> {noreply, State}.
terminate(_Reason, _State) -> ok.
This gen_server stores an integer Count that increments when you call counter:increment(), and returns via counter:get_count().
In the shell:
peixian@Mac ~/c/erl-hot-reloading> erl
Erlang/OTP 27 [erts-15.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]
Eshell V15.2 (press Ctrl+G to abort, type help(). for help)
1> file:copy("counter_v1.erl", "counter.erl"), c(counter).
{ok,counter}
2> counter:start_link().
{ok,<0.95.0>}
3> counter:get_count().
0
4> counter:increment().
ok
5> counter:get_count().
1
6> counter:increment().
ok
7> counter:increment().
ok
8> counter:get_count().
3
The code_change callback handles migration between versions. Right now State is a single integer. Not great if we want to track when it was last updated.
We can change our counter module to:
-module(counter).
-behaviour(gen_server).
-vsn("2.0.0").
-export([start_link/0, get_count/0, increment/0, get_info/0]).
-export([init/1, handle_call/3, handle_cast/2, code_change/3, terminate/2, handle_info/2]).
%% State map
-record(state, {
count = 0,
increment_count = 0,
last_updated
}).
%% -- API --
start_link() ->
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).
get_count() ->
gen_server:call(?MODULE, get_count).
increment() ->
gen_server:cast(?MODULE, increment).
get_info() ->
sys:get_state(?MODULE).
%% -- Callbacks --
init([]) ->
{ok, #state{}}.
handle_cast(increment, State) ->
{noreply, State#state{count = State#state.count + 1,
increment_count = 1,
last_updated = erlang:timestamp()
}}.
handle_call(get_count, _From, State) ->
{reply, State#state.count, State}.
code_change("1.0.0", OldState, _Extra) when is_integer(OldState) ->
NewState = #state{
count = OldState,
increment_count = OldState,
last_updated = erlang:timestamp()},
{ok, NewState};
code_change(_OldVsn, State, _Extra) ->
{ok, State}.
handle_info(_Info, State) -> {noreply, State}.
terminate(_Reason, _State) -> ok.
We added counter:get_info() to return the state, and this migration clause:
code_change("1.0.0", OldState, _Extra) when is_integer(OldState) ->
NewState = #state{
count = OldState,
increment_count = OldState,
last_updated = erlang:timestamp()},
{ok, NewState};
This chunk allows us to migrate old state, checking when the old state was specifically an integer. The vsn tag allows us to easily distinguish which version it's coming from.
In the shell:
peixian@Mac ~/c/erl-hot-reloading> erl
Erlang/OTP 27 [erts-15.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]
Eshell V15.2 (press Ctrl+G to abort, type help(). for help)
1> file:copy("counter_v1.erl", "counter.erl"), c(counter).
{ok,counter}
2> counter:start_link().
{ok,<0.95.0>}
3> counter:increment().
ok
4> counter:increment().
ok
5> counter:get_count().
2
6> file:copy("counter_v2.erl", "counter.erl"), c(counter).
{ok,counter}
7> counter:get_info().
2
8> sys:suspend(counter).
ok
9> sys:change_code(counter, counter, "1.0.0", []).
ok
10> sys:resume(counter).
ok
11> counter:get_info().
{state,2,2,{1769,908677,996152}}
12>
Commands 1-5: load v1, increment twice, count is 2. Command 6: load v2, which gives us counter:get_info().
The key sequence is 8, 9, and 10: we suspend the process, reload the code (calling code_change), then resume. On 11, we see the state has changed from a single integer into the tuple record.
3. Okay, now do it with TCP
Hot reloading is useful because it's fast and preserves user state. Let's see how this works with TCP.
Here's a demo: V1 is an echo server, V2 upcases everything. We hot reload without dropping the netcat connection:
Erlang implements TCP with gen_tcp. You can read more about the specifics, but the short version is gen_tcp implements a listener that hands off the socket to another process.
V1:
-module(tcp_demo).
-behaviour(gen_server).
-vsn("1.0.0").
-export([start_link/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2, code_change/3, terminate/2]).
-record(state, {socket}).
%% -- API --
start_link() ->
spawn(fun() -> start_listener() end).
start_listener() ->
{ok, Listen} = gen_tcp:listen(9000, [binary, {packet, 0}, {active, true}, {reuseaddr, true}]),
io:format("Listening on 9000...~n"),
{ok, Socket} = gen_tcp:accept(Listen),
gen_tcp:close(Listen),
%% Start the gen_server (using start, not start_link, so it's independent)
{ok, Pid} = gen_server:start({local, ?MODULE}, ?MODULE, Socket, []),
ok = gen_tcp:controlling_process(Socket, Pid),
io:format("Socket transferred to PID ~p. Listener dying.~n", [Pid]).
%% -- Callbacks --
init(Socket) ->
io:format("V1 (Echo) Started.~n"),
{ok, #state{socket = Socket}}.
handle_info({tcp, Socket, Data}, State) ->
gen_tcp:send(Socket, Data), %% Echo back whatever the user sent
{noreply, State};
handle_info({tcp_closed, _S}, State) ->
io:format("Client disconnected~n"),
{stop, normal, State}.
handle_call(_Req, _From, State) -> {reply, ok, State}.
handle_cast(_Msg, State) -> {noreply, State}.
code_change(_OldVsn, State, _Extra) -> {ok, State}.
terminate(_Reason, _State) -> ok.
Run this and nc localhost 9000. It echoes back whatever you type.
V2:
-module(tcp_demo).
-behaviour(gen_server).
-vsn("2.0.0").
-export([start_link/0]).
-export([init/1, handle_call/3, handle_cast/2, handle_info/2, code_change/3, terminate/2]).
-record(state, {socket, buffer = <<>>}).
%% -- API --
start_link() ->
spawn(fun() -> start_listener() end).
start_listener() ->
{ok, Listen} = gen_tcp:listen(9000, [binary, {packet, 0}, {active, true}, {reuseaddr, true}]),
io:format("V2 Listener on 9000...~n"),
{ok, Socket} = gen_tcp:accept(Listen),
gen_tcp:close(Listen),
{ok, Pid} = gen_server:start({local, ?MODULE}, ?MODULE, Socket, []),
ok = gen_tcp:controlling_process(Socket, Pid),
io:format("Socket transferred to PID ~p. Listener dying.~n", [Pid]).
%% -- Callbacks --
init(Socket) ->
io:format("V2 (Shouting) Started.~n"),
{ok, #state{socket = Socket, buffer = <<>>}}.
%% V2 Logic: Buffering & Shouting
handle_info({tcp, Socket, Data}, State = #state{buffer = Buff}) ->
NewBuff = <<Buff/binary, Data/binary>>,
{Lines, Rest} = split_lines(NewBuff, []),
%% Process all complete lines found
[gen_tcp:send(Socket, <<(string:uppercase(Line))/binary, "\n">>) || Line <- Lines],
{noreply, State#state{buffer = Rest}};
handle_info({tcp_closed, _S}, State) ->
io:format("Client disconnected~n"),
{stop, normal, State}.
%% -- Migration Logic --
%% Upgrade from V1 (Socket only) -> V2 (Socket + Buffer)
code_change("1.0.0", {state, Socket}, _Extra) ->
io:format("Migrating State: V1 -> V2 (Adding buffer)~n"),
{ok, #state{socket = Socket, buffer = <<>>}};
code_change(_OldVsn, State, _Extra) -> {ok, State}.
%% -- Helpers --
%% Recursively split binary into lines
split_lines(Bin, Acc) ->
case binary:split(Bin, <<"\n">>) of
[Line, Rest] -> split_lines(Rest, [Line | Acc]);
[Last] -> {lists:reverse(Acc), Last}
end.
handle_call(_Req, _From, State) -> {reply, ok, State}.
handle_cast(_Msg, State) -> {noreply, State}.
terminate(_Reason, _State) -> ok.
The V2 code adds a buffer field to the state. Why buffer? TCP is a stream protocol, so it doesn't preserve message boundaries. A single recv might give you half a line, two lines, or one and a half lines. Without buffering, you'd uppercase partial words or chop messages in weird places. The buffer accumulates bytes until we hit a newline, then we process the complete line.
The code_change clause migrates the V1 state (socket only) to V2 (socket + empty buffer). Any partial data in flight gets handled correctly after the reload.
4. This sounds like a maintenance nightmare, why would you do this to yourself?
For most situations you probably don't want to be doing this. However, there's a few cases where it's been invaluable:
- heisenbugs that are state dependent, that you're not sure if you're going to be able to reproduce if you do a reconnect/service restart. As much as we want our systems to be stateless, many of them aren't.
- very quick remediations. I've seen this twice, but hot code reloading effectively means you can parallel ssh into all machines and hot reload and patch, avoiding the need for an hours long deploy across tens of thousands of machines.
You can find all the code for this on https://git.sr.ht/~peixian/erlang-hot-reloading/ as well.
Footnotes:
And the BEAM book is honestly not as detailed about hot code reloading as I would prefer.
Erlang variables start with a Capital letter. Lowercase letters are string literals called atoms.


