Knet on powerpc64le platform?

I am trying to use Knet on a supercomputer using powerpc64le nodes with NVidia TeslaV100 GPUs. But I am not able to compile it (so no use it, with or without GPU access).

It seems the Knet version installed (1.4.2) wants to download libknet8 as an artifact, but as powerpc64le is Tier2, it is not supported in automatic artifact build.

I am able to compile libknet8.so by using the Knet/src/libknet8/build.jl script (with the addition of CFLAGS option stdc=c++11, because gcc version 8.4.4 (redhat 7.6) needs it (else unsatified external _ieee).
Now I am stuck to get Julia /Knet /Pkg use this libknet8. How should I go ?
Should I add an Override.toml file in .julia/artifacts ? (with which format) ?
Should I modify .julia/packages/Knet/rgT4R/Artifacts.toml ?
Should I go directly to the code in src ?

I have tried different things but none works. Possibly someone knows better ?

For the record, here is the compiling test

[jdavid00@login01 tknet]$ julia -q --project=.
julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943* (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, pwr9)
Environment:
  JULIA_EDITOR = vim

(tknet) pkg> st
Status `~/jl/tknet/Project.toml`
  [1902f260] Knet v1.4.2

julia> using Knet
[ Info: Precompiling Knet [1902f260-5fb4-5aff-8c31-6271790ab950]
ERROR: LoadError: LoadError: Cannot locate artifact 'libknet8' in '/m100/home/userexternal/jdavid00/.julia/packages/Knet/rgT4R/Artifacts.toml'
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] do_artifact_str(::String, ::Dict{String,Any}, ::String, ::Module) at /m100/home/userexternal/jdavid00/j152/julia-1.5.2/usr/share/julia/stdlib/v1.5/Pkg/src/Artifacts.jl:1023
 [3] invokelatest(::Any, ::Any, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./essentials.jl:710
 [4] invokelatest(::Any, ::Any, ::Vararg{Any,N} where N) at ./essentials.jl:709
 [5] top-level scope at /m100/home/userexternal/jdavid00/j152/julia-1.5.2/usr/share/julia/stdlib/v1.5/Pkg/src/Artifacts.jl:1068
 [6] include(::Function, ::Module, ::String) at ./Base.jl:380
 [7] include at ./Base.jl:368 [inlined]
 [8] include(::String) at /m100/home/userexternal/jdavid00/.julia/packages/Knet/rgT4R/src/Knet.jl:1
 [9] top-level scope at /m100/home/userexternal/jdavid00/.julia/packages/Knet/rgT4R/src/Knet.jl:11
 [10] include(::Function, ::Module, ::String) at ./Base.jl:380
 [11] include(::Module, ::String) at ./Base.jl:368
 [12] top-level scope at none:2
 [13] eval at ./boot.jl:331 [inlined]
 [14] eval(::Expr) at ./client.jl:467
 [15] top-level scope at ./none:3
in expression starting at /m100/home/userexternal/jdavid00/.julia/packages/Knet/rgT4R/src/libknet8/LibKnet8.jl:7
in expression starting at /m100/home/userexternal/jdavid00/.julia/packages/Knet/rgT4R/src/Knet.jl:11
ERROR: Failed to precompile Knet [1902f260-5fb4-5aff-8c31-6271790ab950] to /m100/home/userexternal/jdavid00/.julia/compiled/v1.5/Knet/f4vSz_ZtxIT.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] compilecache(::Base.PkgId, ::String) at ./loading.jl:1305
 [3] _require(::Base.PkgId) at ./loading.jl:1030
 [4] require(::Base.PkgId) at ./loading.jl:928
 [5] require(::Module, ::Symbol) at ./loading.jl:923

julia>

Not a helpful reply… it would be interesting to find out more about the supercomputer you are using, and what your application of Julia is.

A more helpful reply. An expert will be along in a minute.
You can install a package in development mode so you can have a poke around at the build scripts and the build directory tree

dev Knet.jl

1 Like

Knet.jl has this include
include(β€œlibknet8/LibKnet8.jl”)

LibKnet8,jl has

const libknet8 = Libdl.find_library([β€œlibknet8”], [artifact"libknet8"])

Searches for the first library in names in the paths in the locations list, DL_LOAD_PATH , or system library paths (in that order) which can successfully be dlopen’d. On success, the return value will be one of the names (potentially prefixed by one of the paths in locations). This string can be assigned to a global const and used as the library name in future ccall 's. On failure, it returns the empty string.

Thank you johnh, that was my third option Β« go to the src Β» (btw you are right, better to use dev that modify in place package). However, this means I loose automatic upgrades when a new release appears.

The Overrides.toml seemed better suited as a local installation answer, but I do not understand what syntax to use, even after reading and re-reading Pkg doc … may be I should read again? I was unable to find working similar examples by googling, so asked for help.

Thanks anyway, will keep you updated if no new suggestion appears.

Maybe JLL packages Β· BinaryBuilder.jl helps. This is specific for JLL packages, but the might help understand the syntax. The name of the artifact in your case I think should be libknet8

Thank you @giordano for the pointer, I will take a look and possibly test it as soon as I will sit behind my PC keyboard!

Please feel free to check out pkg"dev Knet" and add a new entry to the Artifacts.toml file. (I am not quite sure what keywords are used for powerpc). You seem to have successfully compiled libknet8.so, all we need to do is to put it to some downloadable location (I use github for the others) and indicate this location in Artifacts.toml and things should work. Once you confirm things work and we finalize the location I will push a new Knet version.

3 Likes

Dear Prof @denizyuret, many thanks for the kind suggestion. I will try to proceed as you suggested. By the way, I succeeded already in using Knet by changing const libknet8=... to the local location, but you suggestion would be much better.

Just for comment, I compared speed (on Lenet) with and without GPU (using a somewhat hack calling Julia as env _=bin/rr julia -a β€”project=. tknet.jl), the speed difference is really impressive! 212 secs vs 4.5 secs

1 Like

News from the front !

tl;dr : while directly pointing dev’ed version works, the artifact one builds and download the artifacts but fails at run time with

julia> @time include("tknet.jl")
ERROR: LoadError: could not load symbol "mul_32_01":
/m100/home/userexternal/jdavid00/.julia/artifacts/18b3c71252f99ebae868dbea3a930e3cc2d89636/libknet8.so: undefined symbol: mul_32_01
Stacktrace:
 [1] broadcasted at /m100/home/userexternal/jdavid00/gits/Knet.jl/src/knetarrays/binary.jl:36 [inlined]
 [2] back(::typeof(sum), ::Type{AutoGrad.Arg{1}}, ::Float32, ::AutoGrad.Result{Float32}, ::AutoGrad.Result{KnetArray{Float32,1}}; d::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./none:0
...

Detailed account :
I have forked Knet at https://github.com/jdadavid/Knet.jl with two mods

  • in Artifact.toml, add following lines at the beginning
[[libknet8]]
arch = "powerpc64le"
git-tree-sha1 = "18b3c71252f99ebae868dbea3a930e3cc2d89636"
os = "linux"

   [[libknet8.download]]
   sha256 = "2c3236f8c83aa00e734ea45a007490b3360bb8da0f419fea4efd724551c43f80"
   url = "https://github.com/jdadavid/Knet.jl/releases/download/v1.4.3/libknet8.ppc64le-linux-gnu.tar.gz"

  • in src/libknet8/build.jl, change 4th line to
CFLAGS = Sys.iswindows() ? ["/Ox","/LD"] : ["-O3","-Wall","-fPIC","-std=c++11"]

And third, I have uploaded to (new) release v1.4.3 the libknet8.tar.gz mentionned in the Artifact (see above)

Then I can instantiate the (cloned) Knet (in ~/gits/Knet.jl), and precompile and do using it. But it fails at run time (for the record, my previous β€œdev Knet” with the const libknet=path-toi-libknet8.so works OK) :

[jdavid00@login03 tknetjd]$ julia -q --project=.
(tknetjd) pkg> st
Status `~/jl/tknetjd/Project.toml`
  [c8e1da08] IterTools v1.3.0
β†’ [1902f260] Knet v1.4.2 `~/gits/Knet.jl`
  [eb30cadb] MLDatasets v0.5.2
Info packages marked with β†’ not downloaded, use `instantiate` to download

(tknetjd) pkg> instantiate
Downloading artifact: libknet8
curl: (22) The requested URL returned error: 404 Not Found
Downloading artifact: libknet8
######################################################################## 100.0%#=#=-#  #                                
julia> @time using Knet
 41.710746 seconds (25.88 M allocations: 1.441 GiB, 1.43% gc time)

julia> @time include("tknet.jl")
ERROR: LoadError: could not load symbol "mul_32_01":
/m100/home/userexternal/jdavid00/.julia/artifacts/18b3c71252f99ebae868dbea3a930e3cc2d89636/libknet8.so: undefined symbol: mul_32_01
Stacktrace:
 [1] broadcasted at /m100/home/userexternal/jdavid00/gits/Knet.jl/src/knetarrays/binary.jl:36 [inlined]
 [2] back(::typeof(sum), ::Type{AutoGrad.Arg{1}}, ::Float32, ::AutoGrad.Result{Float32}, ::AutoGrad.Result{KnetArray{Float32,1}}; d::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./none:0
 [3] back(::typeof(sum), ::Type{AutoGrad.Arg{1}}, ::Float32, ::AutoGrad.Result{Float32}, ::AutoGrad.Result{KnetArray{Float32,1}}) at ./none:0
 [4] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /m100/home/userexternal/jdavid00/.julia/packages/AutoGrad/VFrAv/src/core.jl:165
 [5] differentiate at /m100/home/userexternal/jdavid00/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
 [6] iterate at /m100/home/userexternal/jdavid00/gits/Knet.jl/src/train20/train.jl:26 [inlined]
 [7] iterate at /m100/home/userexternal/jdavid00/gits/Knet.jl/src/train20/progress.jl:73 [inlined]
 [8] progress!(::Knet.Train20.Minimize{IterTools.NCycle{Knet.Train20.Data{Tuple{KnetArray{Float32,N} where N,Array{Int64,N} where N}}}}; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /m100/home/userexternal/jdavid00/gits/Knet.jl/src/train20/progress.jl:60
 [9] progress!(::Knet.Train20.Minimize{IterTools.NCycle{Knet.Train20.Data{Tuple{KnetArray{Float32,N} where N,Array{Int64,N} where N}}}}) at /m100/home/userexternal/jdavid00/gits/Knet.jl/src/train20/progress.jl:60
 [10] top-level scope at /m100/home/userexternal/jdavid00/jl/tknetjd/tknet.jl:41
 [11] include(::String) at ./client.jl:457
 [12] top-level scope at ./timing.jl:174 [inlined]
 [13] top-level scope at ./REPL[4]:0
in expression starting at /m100/home/userexternal/jdavid00/jl/tknetjd/tknet.jl:41

julia>
[jdavid00@login03 tknetjd]$

I will try to understand the problem (may be use debug_artifact ?). For the time beeing, here are my first ideas …

Regarding the first curl error, may be it is because this Knet version is not registered so no cache found on Julia pkgservers? Or is there another problem ? nevertheless, Julia seems able to download the artifact on 2nd try, and I can verify the libknet.so is exactly the same as the one used by my working versiobn (with cont libknet8=path-to-local-so)

So from where does this Loaderror comes ? Does anybody has an idea ?

I can confirm the download works (and I also get the first curl error, you are probably right about the pkgservers). I could not test the rest without access to a powerpc. If I had the next few things I’d try: After doing using Knet confirm the correct libknet8 is used by looking at the path of Knet.libknet8. Make sure the file pointed to is identical to the one that works using cmp file1 file2.

Dear Prof @denizyuret, I have found and corrected the issue (wrong uploaded version of lib).
Current clone https://github.com/jdadavid/Knet.jl is working for me on my (m100) platform.

So if you want to enable ppc64le on the global version, you can

  1. (re-)download the lib at
https://github.com/jdadavid/Knet.jl/releases/download/v1.4.3/libknet8.ppc64le-linux-gnu.tar.gz

and then add to your corresponding release resources.

  1. add corresponding following lines to Artifacts.toml
[[libknet8]]
arch = "powerpc64le"
git-tree-sha1 = "40478d85ecfa6b64bc54ca18e75e477cf15ddf3a"
os = "linux"

   [[libknet8.download]]
   sha256 = "a7f7e8783aaebae2d6379cfca340999de70bb202d776a05fba661d1c8626e7cc"
   url = "https://github.com/jdadavid/Knet.jl/releases/download/v1.4.3/libknet8.ppc64le-linux-gnu.tar.gz"

that should be sufficient to enable working on any ppc64le platform.

Optionnaly, 3) you can modify line 4 in src/libknet8/build.jl, into

CFLAGS = Sys.iswindows() ? ["/Ox","/LD"] : ["-O3","-Wall","-fPIC","-std=c++11"]

the option -std=c++11 being necessary to do the build on ppc64le/redhat/gcc (but may be before adding it, check it is not harmfull on others platforms?).

Alternatively to those two lasts steps, I could try a PR proposal if you prefer. Anyway, for the time being, I have a working version, so many thanks for your support!

Dear @jdad: I implemented your changes in the dy/powerpc branch and opened a pull request: https://github.com/denizyuret/Knet.jl/pull/625

I confirmed that -std=c++11 option did not break x86_64 and copied your precompiled tar.gz to the main Knet repo. Please confirm that this branch works for you and I will merge it to master.

Thanks!

1 Like

Dear Prof. @denizyuret, I can confirm that it does works.

[jdavid00@r232n20 tknetdz]$ julia -q --project=.
(tknetdz) pkg> st
Status `~/jl/tknetdz/Project.toml`
  [c8e1da08] IterTools v1.3.0
  [1902f260] Knet v1.4.3 `/m100/home/userexternal/jdavid00/dz/Knet.jl#dy/powerpc`
  [eb30cadb] MLDatasets v0.5.2

julia> @time include("tknet.jl")
β”£β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”« [100.00%, 1800/1800, 00:33/00:33, 55.32i/s]
 63.820460 seconds (99.83 M allocations: 6.113 GiB, 2.52% gc time)
0.9882

julia> @time include("tknet.jl")
β”£β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”« [100.00%, 1800/1800, 00:06/00:06, 325.75i/s]
  5.707285 seconds (7.05 M allocations: 1.076 GiB, 2.32% gc time)
0.9914

julia> @time include("tknet.jl")
β”£β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”« [100.00%, 1800/1800, 00:05/00:05, 398.82i/s]
  4.859296 seconds (7.10 M allocations: 1.077 GiB, 2.57% gc time)
0.9893

julia> @time include("tknet.jl")
β”£β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”« [100.00%, 1800/1800, 00:05/00:05, 395.86i/s]
  4.860489 seconds (7.06 M allocations: 1.078 GiB, 5.12% gc time)
0.9882

julia>

Thank you very much !

PS, for the record the CPU version

[jdavid00@r232n20 tknetdz]$ env _=bin/rr julia -q --project=.
julia> @time include("tknet.jl")
β”£β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”« [100.00%, 1800/1800, 03:43/03:43, 8.07i/s]
258.580989 seconds (87.91 M allocations: 63.434 GiB, 0.81% gc time)
0.9887

julia> @time include("tknet.jl")
β”£β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”« [100.00%, 1800/1800, 03:46/03:46, 7.95i/s]
230.535260 seconds (4.06 M allocations: 58.869 GiB, 0.37% gc time)
0.9887

and the versioninfos

[jdavid00@r232n20 tknetdz]$ julia -q
julia> using CUDA

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943* (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, pwr9)
Environment:
  JULIA_HDF5_LIBRARY_PATH = /cineca/prod/opt/libraries/hdf5/1.10.6/gnu--8.4.0/lib
  JULIA_EDITOR = vim

julia> CUDA.versioninfo()
CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0
NVIDIA driver 440.64.0

Libraries:
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 10.0.0+440.64.0
- CUDNN: 8.0.4 (for CUDA 10.2.0)
- CUTENSOR: missing

Toolchain:
- Julia: 1.5.2
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

4 devices:
  0: Tesla V100-SXM2-16GB (sm_70, 15.396 GiB / 15.782 GiB available)
  1: Tesla V100-SXM2-16GB (sm_70, 15.762 GiB / 15.782 GiB available)
  2: Tesla V100-SXM2-16GB (sm_70, 15.762 GiB / 15.782 GiB available)
  3: Tesla V100-SXM2-16GB (sm_70, 15.762 GiB / 15.782 GiB available)