Transliterate.jl, IsURL.jl – useful tools

This thread has devolved a little bit, but here’s some technical feedback about IsURL.jl @anon37204545.

IsURL.jl currently uses the following regexes to check for “valid” URLs:

windowsregex = r"^[a-zA-Z]:[\\]"
urlregex = r"^[a-zA-Z][a-zA-Z\d+\-.]*:"

Neither of these are correct. Let’s take a look why.

First, windowsregex to check for valid “windows paths”. Setting aside the fact that julia already has ispath support by default (which works in a platform agnostic way and actually checks filesystem properties), the given regex allows a lot of strings as paths that can simply never be. Take the following:

C:\\???\a\folder

This matches the regex, but fails to meet the microsoft specification on paths - ? are a forbidden character and thus the given string cannot be a path. Moreover, on windows, there are a bunch of forbidden names (AUX, PRN, CON, NUL, to name a few) as well as forbidden characters, which would all be allowed with the given regex.

Ok, let’s pretend we don’t care about (proper) windows path support, people will only plug in correct paths anyway, right? Well, maybe, but drive letters aren’t the only way a volumne can be identified. In fact, there are at least three ways, and none of the other ways are matching the regex.

“Normal” windows paths have a length limitation of 260 characters, so having a path longer than that directly starting with C:\ is not allowed. Moreover, to overcome this limitation (up to about 32 thousand characters), the path could be written with the Unicode supported path format, or \\?\C:\, but this again doesn’t match the regex.


Enough about filepaths, let’s look at the urlregex. URLs are a long-standing accumulation of cruft and legacy, but they at least have a proper format and RFC we can check against. In my day to day work, I often connect to servers directly via their IP, so allowing URLs to start with numbers is critically important to me. Your regex doesn’t allow that, but even the (admittedly not perfect) approach from HTTP.jl is more correct:

julia> URL("172.45.12.3/index.html")
HTTP.URI("172.45.12.3/index.html")

Yay, I can connect to my servers! You might like HTTP.jl, it’s a very nice package.

It looks like your “urlregex” doesn’t check for proper URLs at all, but simply matches the scheme part of a URL (and a trailing :, though that alone doesn’t match the RFC for paths). I wouldn’t rely on this to validate my URLs.


All in all, please don’t be discouraged to create new packages! But please also be aware that there might be existing packages that already handle these kinds of seemingly simple tasks very well. There’s no shame in asking here or in the julialang slack about specific functionality and modules, there are a lot of folks there willing to help new people out.

11 Likes

The OP literally asked for feedback if this was a good idea to port a bunch of mini packages to Julia.

1 Like

no offense and just in a 20% seriousness level, this reminds me of is-odd - npm

1 Like

I am not sure why you think this.

In any case, existing Julia code in Base, the standard libraries, and numerous packages has a lot of examples that seem to contradict this.

2 Likes

Keyword arguments do not participate in dispatch. You may find the whole chapter

https://docs.julialang.org/en/v1/manual/methods

useful, but especially the design patterns.

1 Like

I would argue “a bunch” and “more” are within rounding error of each other.

The waiting period is very useful. If a 3 day waiting period is bottlenecking you it is likely that you are registering an excessive amount of packages. (Note that you can use a unregistered package by just adding its URL so there is no need to wait for it to be registered to start using it, unless you want to register another package with that as a dependency).

7 Likes

Creating a URL object from HTTP.jl does more validation than the regex you provided. I should have clarified that. It does not start a HTTP server or make a request to that URL or any of the other things that require IO. The linked issue is solely about if you actually try to request a malicious URL. Your package has the exact same problem as the linked issue for HTTP.jl, it does no validation on the characters in the string and simply assumes if the string starts with something that looks like a scheme, that it’s a valid URL, which is a faulty assumption to make.

Additionally, URIParser does the right thing: it actually validates (and parses!) the URL. If it doesn’t contain a hier-part, even if it has something that looks like a scheme at the front, it’s not a URL. How this is not good enough for you is beyond me - parsing is fast:

julia> using BenchmarkTools

julia> @benchmark URI(a) setup=(a="https://127.0.0.1/index.html")
BenchmarkTools.Trial:
  memory estimate:  256 bytes
  allocs estimate:  7
  --------------
  minimum time:     1.410 μs (0.00% GC)
  median time:      1.450 μs (0.00% GC)
  mean time:        1.757 μs (0.00% GC)
  maximum time:     29.770 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

In what world is a scale of microseconds for parsing too slow?

I’m sorry, but how often do you recompile your webserver? This is a one-time cost, which is alleviated by using tools such as PackageCompiler.jl or Revise.jl for interactive use. This is not fundamentally different from writing your webserver in Rust or C/C++ and compiling it into a binary. Just because Franklin.jl has that behaviour out of the box doesn’t mean you can’t compile a webserver/site and run that.

Ok, that example was mainly from my day to day work and if you add a protocol the scheme is happy. However, if you add spaces or multiple @, not so much: https://user@127.0.0.1 @example.com/index.html clearly isn’t a valid URL as it contains both a space and two @, which is forbidden by the RFC. Yet your regex still matches and gives the impression that it is a valid URL. After all, the function is called isURL and the package is called IsUrl.jl. Admittedly, that example is contrived. How about this one: !occursin(urlregex, "🤔:"). This is the equivalent of calling isrelativeurl on "🤔:", but the string clearly is neither an absolute URL nor a relative one. Only a scheme is not valid for absolute URLs, so it can’t be absolute. But relative URLs (or any URL for that matter) can’t contain emoji at all, yet that call is returning true!

Incidentally, URIParser.jl again does the correct thing and tells you what’s wrong:

julia> URI("https://user@127.0.0.1 @example.com/index.html")
ERROR: Unexpected character   in server
Stacktrace: [...]

julia> URI("🤔:")
ERROR: Non-ASCII characters not supported in URIs. Encode the URL and try again.
Stacktrace: [...]

How, other than looking into your source and reading that it only checks if the given string is maybe an absolute path, am I as a user supposed to know that what you return isn’t true? There’s multiple assumptions at play here that aren’t addressed, neither by your package, nor HTTP.jl, nor the node package you linked in the repo: People input invalid things. The RFC acknowledges this:

The following line is the regular expression for breaking-down a
well-formed URI reference into its components.

“Well-formed” is tantamount to “playing nice and following everything, including the allowed alphabet”. The regex linked in that section breaks as soon as you don’t play by the rules, which is why it is so dangerous to be relying on “validation”. If you want to seriously validate a URL, buckle up, because your best bet is to implement the grammar I linked to earlier, parse the string into a well-formed object, extract only the part you care about after you’ve parsed it successfully and validated that it is indeed well-formed and then do what you must with it. Or you can simply use an already existing package which does the right thing - your call. There is no shortcut here, everything else can lead to misuse and problems down the road.

Parse, don’t validate. URL-parsing is broken (or here in video form). Let’s fix this.

11 Likes

Wait. At the risk of derailing, this is disingenuous. You solicited input in your opening question:

I am wondering do those packages match the use cases of Julians, since this community doesn’t have the same needs as the Node.js community. If the answer is positive, I will port some more.

You therefore received feedback, as the Julia community is, by and large, responsive to new users’ requests for feedback. It might not have been the feedback you were expecting, but to then come back and say “you shouldn’t have said anything at all” is changing the nature of your original request.

3 Likes

It is also great to be just civil to people who addressed you in a polite manner, even if they disagreed with you.

(Incidentally, I am wondering if this topic is setting some kind of new world record for meta discussion/LOC for a package.)

10 Likes

I deliberately left out the kwargs because I assumed it would be obvious. Here’s how I would write your package:

function transliterate(str, langs::AbstractVector, custom_replacements::AbstractDict)
    # main code goes here
end

transliterate(str; custom_replacements=Dict()) = transliterate(str, ["la"], custom_replacements)
transliterate(str, lang::AbstractString; custom_replacements=Dict()) = transliterate(str, [lang], custom_replacements) 
transliterate(str, langs; custom_replacements=Dict()) = transliterate(str, langs, custom_replacements)

There are variants to this that put custom_replacements as a kwarg on the main function, but my preference is explicitness. Also, there are other issues with your current code that someone here with the time and inclination could step you through.

3 Likes

Just my two cents: Not a big of adding unnecessary dependencies for a few lines of code. It isn’t very “Julian”. The OP may ofcourse publish your package, but if I do need to check whether something isURL, I’d just Google the code or copy/paste the package code if permitted.

If your code can exist in a larger package, like HTTP.jl for example, it may be worth it to submit a PR there to either enhance their code or add new functionality.

I would say that constructive discussion of NPM-like package ecosystems and their myriad tradeoffs is very germaine to this conversation. That said, this topic has long passed that point and would be better moved to a separate, neutral space disassociated with anyone’s work.

@Mason made a great point about large group responses/“dogpiling” (hate the term, but I can’t think of a less charged one). Given the large quantity of uninformed skepticism that Julia receives, it’s understandable that the community feels somewhat embattled. However, the resulting deluge of replies posts like this receive (most all of which are factual and not aggressive, I should add) has a tendency to draw the battle lines too early on all sides and effectively railroad any discussion into a heated back-and-forth.

More specifically, “we do things this way in $LANG/$ECOSYSTEM” and related statements are great for introducing idiomatic patterns and conventions but need to be eased into. For example, imagine @Sukera’s post about URIParser.jl and edge cases in regex-based URL validation was the first reply in this thread. It would be much easier to then point out that said package is only ~2k LOC (perfectly acceptable by NPM standards, I might add!) and should not be considered a slow or heavyweight dependency. The discourse could then have pivoted to package scope and how related functionality is usually grouped into the same package. Perhaps IsURL.jl makes more sense as a “fast path” preflight check in URIParser.jl that allows the latter to bail early if it fails the regex check?* This is obviously a very contrived and idealistic course of events, but I’d argue it’s something to strive for.

* I’ve written too much already, but this is a personal pet peeve with NPM. Many micro-packages do not compose well because of differing assumptions about interface design/less thought about extensibility. There are many technical and political reasons for this, but suffice it to say that the lack of collaboration does not a good package ecosystem make :slight_smile:

6 Likes

While generally a good idea, it’s not much use in this case. Having a bit of text followed by : does not make a valid scheme as part of an URL, simply because the scheme doesn’t exist at all without the accompanying rest. Without the rest, it’s just some text suffixed with :. That was the whole point I was trying to make: without parsing the URL into structured data, we can never be sure that the “simple regex” won’t fail at some point. This is the core of the whole “parsing vs validaton” debate. As a developer, I’d rather be able to know precisely why some input didn’t meet expectations and where during parsing it all went wrong, than having to debug and single step through an opaque regex evaluation (which is called out to a library here anyway and is thus not steppable through standard means without whipping out gdb. good luck with that).

For me, this truly isn’t about how many LOC this package has over another (the compile times and parse times are negligible in either case), but that this package aims to provide essential infrastructure code for web development and falls oh so very far of providing a trusted foundation you can rely on.

3 Likes

While I agree with you that this seem silly. Apparently it’s getting a half a million downloads a week…so somebody is using it rather than doing the check themselves. Although maybe that’s just because it got bundled into a bigger more popular package.

I find it interesting that it has a dependency…is-number. Following the links to the Author he has 837 repositories in GitHub…so I guess he excels at creating small packages.

As for the the question of how small a package can be…my answer would be it depends. If I’m often grabbing the same 10 packages from the same author to do something, then it would help unclutter the namespace (and make it easier to install) if they were all bundled into a single package.

So if an author is going to create a bunch of different packages all to do or help with X, then my feeling would be that they should create 1 package X with all the functionality. If the package only has 1 function and there are no plans to do anything else in that area…then all you can do is a simple package.

1 Like

I am very surprised too.

Having similar functions in a package is certainly a good decision. However, from a performance point of view, as long as we are not able to load a subset of a package, or have multiple packages in one repository, having small packages will have superior performance.

The only con of this is having another deps entry, which is a very minor thing, especially by considering how easily Pkg is able to handle the deps.

In Node packages, one can do this sort of import:

import IsURL: isurl

and the tree shaking tools like Rollup, Webpack, etc, will only import that in the package. This significantly improves the loading time performance.

In Julia, I have not read anywhere saying that there is a difference between import and using, in terms of loading performance.

Having small packages removes the need for using tree shaking tools.


Another aspect of this topic is the functions that don’t belong to any other package with similar functionality, and from a modularity perspective, it is always a better decision to move the generic parts of a package to another module, so everyone can use them.
Here I am very surprised by that comment that was saying some code copying is good in Julia!

Take this simple julia_versionnumber function for example. This fetches the nightly version of Julia from the repository. My feeling says that there is no better place other than Julia repository itself (in version.jl) that this should go.

However, there is always some inertia for adding functions to the Base, even if they are working exactly with the Julia repository! and now, I have no other options to move this to a repository of itself and call it JuliaVersionNumber or something.

3 Likes

The browser doing normalization before calling a URL is not the same as having a valid URL. You wrote yourself that 127.0.0.1/index.html is not a valid URL, yet entering it into a browser will load correctly, provided that a webserver is actually running on localhost and serving that document. This is because the browser, as my user-agent, tries to guess what I wanted to do and adds the protocol - something the RFC itself suggests as viable. If I enter https://🤔 into a browser, what the browser actually tries to load is https://xn--wp9h/ - a normalized string, since non-ASCII URLs are not valid and prone to be confused with other strings. That’s the whole purpose of the other package you created in this thread, Transliterate.jl (albeit targeted at languages and not emoji, though the idea is the same).

Further, HTTP.open is a different beast than constructing a URL or checking if a URL is valid and more akin to entering the text into the browser. Normalize the input string, parse it into a URL and try to access the result? That’s fine and wanted for something like HTTP.open. However, saying the same thing when it comes to a library function that should tell you if an arbitrary string is a valid URL doesn’t make sense, since the string is - by definition - arbitrary and not necessarily a URL. An input string that can’t be a URL because it contains emoji/multibyte sequences should never make this function return true, even if the normalized version of the same string is valid.

1 Like

It is not a valid absolute URL, which is what isurl() returns.

That’s not the use case. It is usually used before sending a HTTP request to the given URL, so that you don’t have to waste the bandwidth. I don’t claim IsURL is superior in that regard, and after all, it is intended to be simple. I may decide to change the approach by which URLs are validated, but library’s use doesn’t change.

Here are some comparisions regarding what you mentioned previously:

julia> @benchmark isurl("http://xn--g2aaa.xn--90a3ac/")
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     100.426 ns (0.00% GC)
  median time:      106.529 ns (0.00% GC)
  mean time:        107.478 ns (0.00% GC)
  maximum time:     216.771 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     943
julia> @benchmark URI("http://xn--g2aaa.xn--90a3ac/")
BenchmarkTools.Trial: 
  memory estimate:  400 bytes
  allocs estimate:  9
  --------------
  minimum time:     1.347 μs (0.00% GC)
  median time:      1.403 μs (0.00% GC)
  mean time:        1.415 μs (0.00% GC)
  maximum time:     2.991 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

For more complex links like https://user%40127.0.0.1%20@example.com/index.html, the difference is >20 times.

A check which is 10-15 times faster doesn’t seem like much when we are speaking microseconds. However, those microseconds get multiplied by 100 links, to be served to 1000 users… which gives you all the calculations you need.

That’s just one of the uses (see slugify - npm). I intended to use it for writing actually. Of course, if you are creative enough, you can combine any two packages and make use of them. :smile:

Right, in that regard https://🤔/ also isn’t a valid absolute URL, because it’s not a URL at all because of the emoji.

Well, now we’re at the point of comparing a function that doesn’t do what it claims to do (no validation if the given string is even a URL) to a full fledged, correct implementation. This hardly seems like a useful comparison, especially given that the derivative function isrelativeurl claims even the nonsensical __http://http://http:// to be a valid relative URL…


In any case, this discussion seems to have run its course. You asked for feedback, you were given feedback. Even though you don’t seem to particularly like what was given, regardless of merit.

2 Likes