Thank you very much, this is great feedback!
Your setup was very similar to mine, except:
- I have 8GB RAM single bank, you have 16GB probably on two banks
- my CPU is "half" of yours, 2 cores 4 threads
despite this, the results are very similar; maybe yours are slightly better. I don't understand why this "60K" seems so hard to beat. What's so special about it?? It's so difficult to understand what to do to improve the conversion speed... do I buy more ram? Faster ram? A faster CPU? More cores? Or a CPU with more cache? Or more memory channels? I still can't find an answer... Why would more cores help if tdb2.tdbloader runs in a single thread? Maybe the reason is that with more cores, your xeon can handle more RAM concurrently? I don't understand...
With your xeon, you said you were able to get to 120K? Right? What xeon, mobo, and RAM did you use?
If anybody has any xeon or opteron, it would be nice if they could offer more feedback too. Even with slower RAM such as DDR3-1333. I certainly can't wait to read your feedback with the Threadripper :)
keep us posted!
Sent: Friday, December 01, 2017 at 9:11 PM
From: "Dick Murray" <***@gmail.com>
To: ***@jena.apache.org
Subject: Re: tdb2.tdbloader performance
Hi.
Sorry for the delay :-)
Short story I used the following "reasonable" device
Dell M3800
Fedora 27
16GB SODIMM DDR3 Synchronous 1600 MHz
CPU cache L1/256KB,L2/1MB,L3/6MB
Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads
to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM
disk and;
@800% 60K/Sec
@100% 40K/Sec
@50% 20K/Sec
The full source file contains 2.2G of triples in 10GB bz2 which
decompresses to 250GB nt, which I split into 10M triple chunks and used the
first one to test.
Check with Andy but I think it's limited by CPU, which is why my 24 core (4
x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no
performance hit.
I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the
next few days and I will try and test against it.
I haven't run the full import because a: i'm guessing the resulting TDB2
will be "large" b: my servers are currently importing other "large"
TDB2's!!!
Long story follows...
decompress the file;
pbzip2 -dv -p4 -m1024 latest-truthy.nt.bz2
Parallel BZIP2 v1.1.12 [Dec 21, 2014]
By: Jeff Gilchrist [http://compression.ca]
Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com[http://javornikolov.wordpress.com]]
Uses libbzip2 by Julian Seward
# CPUs: 4
Maximum Memory: 1024 MB
Ignore Trailing Garbage: off
-------------------------------------------
File #: 1 of 1
Input Name: latest-truthy.nt.bz2
Output Name: latest-truthy.nt
BWT Block Size: 900k
Input Size: 9965955258 bytes
Decompressing data...
Output Size: 277563574685 bytes
-------------------------------------------
Wall Clock: 5871.550948 seconds
count the lines;
wc -l latest-truthy.nt
2199382887 latest-truthy.nt
Just short of 2200M...
split the file into 10M chunks;
split -d -l 10485760 -a 3 --verbose latest-truthy.nt latest-truthy.nt.
creating file 'latest-truthy.nt.000'
creating file 'latest-truthy.nt.001'
creating file 'latest-truthy.nt.002'
creating file 'latest-truthy.nt.003'
creating file 'latest-truthy.nt.004'
creating file 'latest-truthy.nt.005'
...
Restart!
sudo cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
ps aux | grep tdb2
root 3358 0.0 0.0 222844 5756 pts/0 S+ 19:22 0:00 sudo
cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 3359 0.0 0.0 4500 776 pts/0 S+ 19:22 0:00 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 3360 0.0 0.0 120304 3288 pts/0 S+ 19:22 0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root 3361 4.9 0.0 4500 92 pts/0 S<+ 19:22 0:05 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 3366 95.7 14.8 7866116 2418768 pts/0 Sl+ 19:22 1:42 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick 3477 0.0 0.0 119728 972 pts/1 S+ 19:24 0:00 grep
--color=auto tdb2
Notice PID 3366 is -Xmx2G default.
19:26:49 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 247.28s (Avg: 42,404)
After the first pass there is no read from the 1TB source as the OS has
cached the 1.2G source.
19:33:50 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 245.70s (Avg: 42,677)
export JVM_ARGS="-Xmx4G" i.e. increase the max heap and help the GC
sudo ps aux | grep tdb2
root 4317 0.0 0.0 222848 6236 pts/0 S+ 19:35 0:00 sudo
cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4321 0.0 0.0 4500 924 pts/0 S+ 19:35 0:00 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4322 0.0 0.0 120304 3356 pts/0 S+ 19:35 0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root 4323 4.8 0.0 4500 88 pts/0 S<+ 19:35 0:09 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4328 94.8 18.5 8406788 3036188 pts/0 Sl+ 19:35 3:01 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick 4594 0.0 0.0 119728 1024 pts/1 S+ 19:38 0:00 grep
--color=auto tdb2
At 800K PID was 3GB and peaked at 3.4GB just prior to completion.
19:39:23 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 247.65s (Avg: 42,340)
Throw all CPU resources at it i.e. 800
sudo cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
Average was at +45K by 350K and +60K by 1.2M
19:43:38 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 166.91s (Avg: 62,823)
sudo ps aux | grep tdb2
root 4740 0.0 0.0 222848 6264 pts/0 S+ 19:40 0:00 sudo
cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4744 0.0 0.0 4500 720 pts/0 S+ 19:40 0:00 cpulimit
-v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4745 0.0 0.0 120304 3208 pts/0 S+ 19:40 0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root 4746 4.7 0.0 4500 92 pts/0 R<+ 19:40 0:07 cpulimit
-v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4751 131 21.1 8693508 3448252 pts/0 Sl+ 19:40 3:32 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick 4808 0.0 0.0 119728 1060 pts/1 S+ 19:43 0:00 grep
--color=auto tdb2
Heap peaked at 3.4GB
sudo cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
sudo ps aux | grep tdb2
root 4898 0.0 0.0 222844 5672 pts/0 S+ 19:45 0:00 sudo
cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4899 0.0 0.0 4500 724 pts/0 S+ 19:45 0:00 cpulimit
-v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4900 0.0 0.0 120304 3244 pts/0 T+ 19:45 0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root 4901 5.5 0.0 4500 92 pts/0 S<+ 19:45 0:25 cpulimit
-v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4906 50.5 20.7 8685316 3395236 pts/0 Tl+ 19:45 3:55 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick 4983 0.0 0.0 119728 1072 pts/1 S+ 19:53 0:00 grep
--color=auto tdb2
19:53:38 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 482.27s (Avg: 21,742)