Best practices for managing websocket/http server lifecycle

danvinci · January 4, 2025, 8:20pm

Hello folks!

I’m experimenting with HTTP.jl and I’m trying to put in a place a watchdog task that would regularly check on the server, and spin up a new one if it goes down.

The issue I’m encountering is that I get this error: “Base.IOError(“listen: address already in use (EADDRINUSE)”, -48)” , despite any server being terminated - I’m saving their tasks in an array and can inspect their task status.

The only way to make it work is to restart the Julia process (I’m testing this in Pluto btw).

This is my implementation:

Watchdog function:

function start_ws_watchdog()

## on first run
	
	# terminate any running watchdog(s) - by InterruptException
	terminate_active_watchdog!()

	# terminate any active websockets
	terminate_active_ws!()
	
	sleep(1)

	# start new watchdog
	watchdog_task = @spawn try 

			#ws_array |> empty!

			while true
				try
				
					# start websocket server if there's none or they're all done/closed
					if isempty(ws_array) || all(ws -> istaskdone(ws.task), ws_array) || all(ws -> !isopen(ws.listener.server), ws_array)
						start_ws_server()				
					end

				catch e
					
					if isa(e, Base.IOError) && occursin("EADDRINUSE", e.msg)
						# port already in use, re-start was too early
						@warn "WARNING - ws start, port already in use $e"
						sleep(1)	
						terminate_active_ws!()
					else
						rethrow(e)
					end
				end
				
				sleep(5) 
			end
				
		catch e
			if isa(e, InterruptException)
	            @warn "LOG - Websocket watchdog terminated"
	        else
				@error "ERROR - ws watchdog, $e"
				rethrow(e)
	        end
		end

	@info "LOG - New watchdog started: $watchdog_task"
    
    # save watchdog task reference
    push!(ws_watchdog, watchdog_task)

end

WS server:

function start_ws_server()


	# start new server (it spawns its on task)
    ws_server = WebSockets.listen!(ws_ip, ws_port; verbose = true) do ws

		_info = "LOG - New Websocket server started: $(ws_server.task)"
		@info _info

		for msg in ws

			@spawn try

				# save received messages as-is
				lock(msg_lock)
					push!(ws_msg_log_raw, msg)
				unlock(msg_lock)

				parsed_msg = JSON3.read(msg, Dict) |> dict_keys_to_sym
				
				# save parsed message
				setindex!(parsed_msg, "received", :type)
				lock(msg_lock)
					push!(ws_msg_log, parsed_msg)
				unlock(msg_lock)

				# pass msg on
				msg_handler(ws, parsed_msg)
			
			catch e
				@error "ERROR - Message handler error, $e"
			end
		end
		
    end

	# saves server handler & task for reference
    push!(ws_array, ws_server)
	
end

And these are the killer functions:

function terminate_active_ws!()

	for ws in filter(ws -> !istaskdone(ws.task) || !isempty(ws.connections) || isopen(ws.listener.server), ws_array)
		HTTP.forceclose(ws)
		_num_conn = ws.connections |> length
		@warn "WARNING - Terminating ws server: $(ws.task) - $_num_conn connections"
	end
end

function terminate_active_watchdog!()

	for task in ws_watchdog
		if !istaskdone(task)
   			schedule(task, InterruptException(), error=true)
			@warn "WARNING - Terminating ws watchdog: $task"
		end
	end
end

I don’t understand what I’m doing wrong.
It would work fine for a while (<1hr?), but then get unresponsive - even if server still has a running task and it’s open, and finally give this error if I try to trigger server restart through the watchdog.

Anyone encountered similar issues?

g-gundam · January 4, 2025, 8:43pm

Just out of curiosity, is your main intention to eventually use this code outside of Pluto? …or are you trying to setup a websocket server primarily for use inside Pluto?

I’m mainly asking, because trying to do this inside Pluto can complicate the situation due to how Pluto wants to control order-of-execution. (Playing with long-lived tasks in Pluto is tricky in general.) Depending on how you answer the question above, the solution may change a bit.

I’ve been doing a bit of websocket work myself lately, and I feel like I could come up with a solution in a non-Pluto context, but inside Pluto would be challenging.

danvinci · January 4, 2025, 11:34pm

Oh yeah, this is meant to live as a script inside Docker - I use Pluto out of convenience for development and quick inspection.

Wasn’t thinking that Pluto could be messing with it, thanks for the heads-up!
I’ll test it more extensively out of it then

g-gundam · January 5, 2025, 3:51am

I turned your code into a gist that people can clone and try in the REPL. I also put some instructions at the bottom in a comment.

So far, so good. I’ll let you know if I get any EADDRINUSE issues. If I do, that would imply to me that the websocket server that was supposed to be killed didn’t die. We’ll see if it happens.

j_u · January 5, 2025, 11:58am

Hi, thanks for the gist. I am wondering, what do you think about Visor.jl? Would it be applicable in this case?

g-gundam · January 5, 2025, 5:42pm

Visor.jl looks like something worth looking into. I think it could be used here.

j_u · January 5, 2025, 7:03pm

To be honest, I have used it before in a similar scenario; however, for some reason, I lost all my previous code and notes on this topic. I guess the question is: could it be more desirable than the currently proposed solution?

danvinci · January 6, 2025, 1:41am

So, I kept investigating the issue, and this is my understanding so far.
Some context first:

the code above is part of a personal project
the call to msg_handler function is part of a chain of ~dozen functions, with each using setters/getters on a global dictionary
I’m using re-entrant locks with the try/catch/finally pattern

What happened:

one of the msg_handlers task didn’t terminate correctly, and kept holding the lock
this blocked any following msg_handler tasks, so the whole service could only receive but never respond

Now I’m trying to nail down in which function downstream this happens and how.

As for Visor: thanks for sharing, I didn’t know about it!

(@g-gundam thanks for the gist btw)

j_u · January 6, 2025, 12:19pm

You are welcome; all credit goes to @attdona, who wrote the package, and to @g-gundam for facilitating this discussion. By the way, to be honest, I have to admit that the Visor documentation related to websockets could be a bit more explicit. Don’t you think so?

attdona · January 6, 2025, 3:49pm

The Visor package provides tools for designing a task supervision tree. However, it is not tied to or directly related to the WebSocket protocol.

For tasks more relevant to WebSockets, you might want to explore the Rembus.jl package. Rembus builds on Visor to implement middleware for Remote Procedure Calls (RPC) and Publish/Subscribe (Pub/Sub) communications. While I don’t know if Rembus suits your specific case, it can serve as a “reference implementation” of a supervised WebSocket server.

To dive deeper, a good starting point with Rembus would be the serve_ws task, which implements the WebSocket server. Its implementation is available here:

github.com

cardo-org/Rembus.jl/blob/main/src/broker.jl#L2203


      
          
              _serve_http(td, router, http_router, port, issecure)
          end
          
          function router_ready(router)
              while isnan(router.start_ts)
                  sleep(0.05)
              end
          end
          
          function serve_ws(td, router, port, issecure=false)
              @debug "[serve_ws] starting"
              router_ready(router)
              router.listeners[:ws].status = on
          
              sslconfig = nothing
              try
                  if issecure
                      sslconfig = secure_config(router)
                  end

With a basic understanding of the Visor API, I hope this helps illustrate the potential of Visor (and also of Rembus) and its use cases.

j_u · January 7, 2025, 12:05am

Ah, so it was Rembus. I lost all my notes related to this project. Thank you very much for providing this additional information.

g-gundam · January 7, 2025, 12:42am

I think Stefan’s advice at the very end of his post is good, because it eliminates locks altogether.

danvinci · January 7, 2025, 2:12am

@attdona appreciate you elaborating and suggesting.

@g-gundam yep, I’m leaning towards it!

Topic		Replies	Views
Can't reopen socket (EADDRINUSE) General Usage networking , sockets	1	29	May 26, 2025
HTTP.jl Websockets- help getting started Web Stack	5	1067	June 19, 2024
HTTP.jl IOError: stream is closed or unusable General Usage	14	2495	April 29, 2021
Why can't I catch the following error? New to Julia	4	207	October 26, 2024
Http.jl websocket is not connecting after some time - EOFError: read end of file General Usage http	2	72	September 6, 2024

Best practices for managing websocket/http server lifecycle

Related topics