Monday, December 28, 2015

Prime tables and the prime counting function

The only possibility for a very fast prime counting function is to make tables. To make a straight forward table would take a lot of time and a lot of memory, but a table of primes is more compact, is loading faster and can be used for fast prime counting.

It's possible to store consecutive prime series in tables using only two bytes per prime number up to virtually any size (at least 100 decimals or so), with a technique using that the prime gaps for those numbers are less than 65536=2^16. What's stored in the main table are the prime numbers modulo 65536 - and in a minor table "break points" are stored: the number n when the next prime Pn+1 divided with 65536 is greater than the same for the previous prime. A fast search in the minor table gives the index number, that multiplied with 65536 should be added to the number in the main table to get the prime.

Such a table for all primes less than 2^24 is created in less than a second and can be used to compute π(x) for x<16777216 in a few microseconds. The memory needed for the main table is 2*π(16777216)=2*1077871, slightly more than 2 Mb, and for the minor table 1-2 kb if the break points are stored as single integers. The memory saved is near 50 % in this case, but for larger tables the memory saved will be 100 % or more.

Making a table for all 64 bit primes would need about 5 billion Gb. All 32 bit primes would need about 1 Gb. All primes less than 100,000,000 would need about 10 Mb and would take some time to load (about 5 seconds). For greater values it's more realistic to use other and slower methods, like files with records of primes and their order numbers, with one second computing time gap in between.

First some words that is used:

: prevprime ( numb -- prime )
  dup 3 u< if drop 2 exit then
  1- 1 or 
  begin dup fermat-rabin-miller 0=
  while 2 -
  repeat ;

: ?def ( -- flag )  bl word find nip 0= ;

?def under+ [if]

: under+ ( a b c -- a+c b )  rot + swap ; [then]

: 8/mod ( n -- r q )  dup 7 and swap 3 rshift ;

: 256/mod ( n -- r q )  dup 255 and swap 8 rshift ;
: 65536/mod ( n -- r q )  dup 65535 and swap 16 rshift ;

The idea is to use the sieve of Eratosthenes to create av very fast word prime_ ( n -- flag ) and bit arrays is used.

\ bit arrays

: adbit ( i ad -- j a )  swap 8/mod rot + ;

: setbit \ i ad --

  adbit 1 rot lshift over c@ or swap c! ;

: tglbit \ i ad --

  adbit 1 rot lshift over c@ xor swap c! ;

: clrbit \ i ad --

  adbit 1 rot lshift 255 xor over c@ and swap c! ;

: bit@ \ i ad -- bit

  adbit c@ swap rshift 1 and ;

: bit! \ bit i ad --

  rot if setbit else clrbit then ;

: bit? ( i ad -- f )  bit@ negate ;

\ Defining named bitarrays i.e.
\ 1000000 bitarray megaset constant megasetbuffer
: bitarray \ bits -- ad | n -- bit
  8/mod swap 0= 0= negate + allocate throw
  create dup ,
  does> @ swap 8/mod rot + bit@ ;

For storing numbers in i=2,3,... bytes spaces I use the words 

mb! ( n ad i -- ) 


mb@ ( ad i -- n ) 

\ multiple byte read/write in arrays

variable ebuf
1 ebuf ! ebuf c@ negate
[if] \ little-endian, as in Windows and Android
: mb! ( n ad i -- )  2>r ebuf ! ebuf 2r> cmove ;
: mb@ ( ad i -- n )  ebuf 0 over ! swap cmove ebuf @ ;
[else] \ big-endian, as in MacOS
: mb! ( n ad i -- )  2>r ebuf ! ebuf cell + r@ - 2r> cmove ;
: mb@ ( ad i -- n )  ebuf 0 over ! cell + over - swap cmove ebuf @ ;

The tables

The method below can be used for any number of primes. Here plim=2^24 and pi_plim=π(plim)=1077871:

\ the sieve of Eratosthenes 

0xfffff constant plim
82025 constant pi_plim

\ 16777215 constant plim   \ Loads in 
\ 1077871 constant pi_plim \ a few seconds

\ 100000000 constant plim  \ 100000000 takes 6 times 
\ 5761455 constant pi_plim \ longer time to load

plim bitarray prime_ constant pbuf
\ prime_ ( n -- flag )  n<plim

: resetsieve ( -- )
  pbuf plim 8/mod swap 0= 0= - + pbuf 
  do true i ! cell +loop ;

: sieve ( -- )

  0 pbuf clrbit 
  1 pbuf clrbit
  plim sqrtf 1+ 2
  do i pbuf bit@
     if plim 1+ i dup *
        do i pbuf clrbit j +loop
  loop ; \ takes approximate 3 seconds or less to load

Next improvement of the prime functions:

plim prevprime constant pmax_

: nextprime_ ( numb -- prime )  \ numb<pmax_
  dup 3 u< if drop 2 exit then
  1 or
  begin dup prime_ 0=
  while 2 +
  repeat ;

: prevprime_ ( numb -- prime )
  dup 3 u< if drop 2 exit then
  1- 1 or 
  begin dup prime_ 0=
  while 2 -
  repeat ;

: prime ( n -- flag )

  dup plim u> if prime else prime_ negate then ;

: nextprime ( numb -- prime )

  dup pmax_ u< if nextprime_ else nextprime then ;

: prevprime ( numb -- prime )

  dup plim u< if prevprime_ else prevprime then ;

It takes less than a second to store all primes less than plim=2^24 in the buffer pnrbuf.

plim 65536/mod swap 0= 0= - constant breaknumbers
pi_plim 2* allocate throw constant pnrbuf 
breaknumbers cells allocate throw constant breaks 

: init_pnr ( -- )
  0 pad ! 0 breaks ! 
  1 pi_plim 1+ 1 
  do nextprime_ dup                   \ p p
     65536/mod pad @ = 0=             \ p r flag
     if 1 pad +!                      \ p r
        i 1- pad @ cells breaks + !   \ p r
     then pnrbuf i 1- 2* + 2 mb! 1+   \ p+1
  loop drop ; init_pnr 

: newintbreaks ( i j x -- i' j' )     \ bisection search
  >r 2dup + 2/ dup
  cells breaks + @ r> u>              \ i j k flag 
  if -rot nip else nip then ; 

: pnr@ ( n -- p )                     \ p is n:th prime 

  1- dup >r 2* pnrbuf + 2 mb@         \ n ≤ pi_plim
  breaknumbers 0 
  begin r@ newintbreaks 2dup - 2 u< 
  until rdrop nip 16 lshift + ; 

From n comes the modulo in pnrbuf and bisection search in breaks gives the index i. Then Pn=Pn(mod 2^16)+i*2^16, where 0≤i<breaknumbers.

: newintpnr ( i j x -- i' j' ) 
  >r 2dup + 2/ dup pnr@ r> u>         \ i j k flag 
  if -rot nip else nip then ; 

: pi ( x -- n ) >r pi_plim 1+ 0       \ x ≤ plim

  begin r@ newintpnr 2dup - 2 u<      \ i j flag 
  until rdrop nip ;

A conjecture

A multiplicative function is such that f(ab)=f(a)f(b). All multiplicative functions on the positive integers can be defined by the action on the primes in the prime factorization and any function defined that way is multiplicative. The function that maps primes p on π(p), that is, the nth prime on n, defines a multiplicative function f that maps positive integers on positive integers.

Blue plot f(n) and red plot π(n)
The conjecture is that f(n)≤π(n) except for n=35,49 and there is a proof on mathematics stack exchange but it's unclear if the proof is verified. Testing the conjecture for all n≤plim takes a few minuts if plim=100000000:

\ conjecture: if f:p1^n1*...*pk^nk -> 1^n1*...*k^nk ==>
\ f(n)π(n) for n>49

: func_f ( n -- m )
  oddpart nip pollard# sort 1 swap 0
  do swap pi * loop ;

Since f(Pn)=π(Pn)=n it's enough to test non primes:

: conjecture ( N -- )  1
  do i prime_ 0=
     if i func_f i pi u>
        if cr i .
  loop ;

100000000 conjecture
49  ok


  1. "...more than 100 %."

    You can't save more than 100%.

  2. No, you are right! But it could takes more than 100 % more memory NOT using any squeezing method. Happy new year!

  3. While the mechanics of doing table-based prime counts are very interesting, I don't agree with the opening thesis: "The only possibility for a very fast prime counting function is to make tables." There are very fast prime count methods that don't use tables.

    1. I really would like to see that algorithm. The algorithm at Wikipedia is very slow. There is a linked paper with some complicated computations that might make it faster, but I can't imagine it's comparable with table-based counting.

    2. The methods of Legendre, Meissel, and Lehmer all work quite well. LMO faster yet. The first three are well described in Riesel's 1994 book. LMO is rather complicated and I haven't seen an exposition that easily leads to code (combining the papers and spending quite a bit of time does work, but it's nothing like the others).

      See benchmark results and LMO code at:
      All the other algorithms are also in that repo.

      Also see:
      which also has implementations of all the mentioned methods.

      Sparse tables are a good method for small inputs, and my code uses them to 60M. But I'm also making a generic library and nobody wants big tables. The nth prime page can use absurdly large tables since that is the one and only application, and every user doesn't have to store the files. Note the time up to 10 billion in these functions are in the single millisecond or lower range -- I use tables to optimize for the case of being called inside a tight loop. Kim doesn't do anything different for small values.

      Assuming your goal was to stay competitive with those, some good use of indices or estimates, sparse tables, and a good sieve for the difference could probably keep up to 10^18 or so using under 10MB. At some point you'll have the same problem as a multi-petabyte dense table system -- you've run out of data that other people have calculated, and now you have to start sieving to expand the table. How long will it take to fill in the table from 1*10^18 to 2*10^18?

    3. I'll look at it once again. It's the recursive computation of phi that made me suspicious. I can't understand that a recursion formula like


      can be that fast as in your link.

  4. Speeding up the phi function is a critical part of making it fast -- you can't naively use the recursive formula for large values. There are a number of ways to do it, each with different tradeoffs. Mapes' method is one way to handle small values. Another good method for dealing with small a values is to use small tables, which can give you almost instant results for phi(x,a) when a < 10. Caching can also be done to vastly cut down on the number of recursions (basically doing efficient memoization). I made a very different method that creates a sorted list as it walks down the a's -- this has very few function calls but can be memory intensive.

    Sage uses an implementation of Legendre's method, with interesting phi optimizations -- Mapes' type calcs for small values, some shortcuts for large values, and then tables. You can see the code here:

    primecount now uses eLMO by default, but the other algorithms are still there. His Phi uses small tables with small a values, caching otherwise.

    Perl/ntheory uses eLMO by default, but the other algorithms are still there (no longer compiled in by default). There are a few ways to do phi -- small a with small tables (~32k of shorts), caching (32MB cache), or my a-walker.