Discussion:
My PC's cores can synchronize to about 1.000 clock cycles accuracy
(too old to reply)
Bonita Montero
2024-09-13 16:45:41 UTC
Permalink
#include <iostream>
#include <barrier>
#include <thread>
#include <vector>
#if defined(_WIN32)
#include <intrin.h>
#elif defined(__linux__)
#include <x86intrin.h>
#endif

using namespace std;

int main()
{
unsigned hc = thread::hardware_concurrency();
barrier bar( hc );
atomic_uint synch( hc );
atomic_uint64_t zero( 0 );
atomic_int64_t diffs( 0 );
auto thr = [&]()
{
int64_t sum = 0;
for( unsigned t = 1'000; t; --t )
{
bar.arrive_and_wait();
if( synch.fetch_sub( 1, memory_order_relaxed ) > 1 )
while( synch.load( memory_order_relaxed ) );
uint64_t tsc = __rdtsc(), expected = 0;
if( !zero.compare_exchange_weak( expected, tsc, memory_order_relaxed,
memory_order_relaxed ) )
sum += abs( (int64_t)(expected - tsc) );
bar.arrive_and_wait();
synch.store( hc );
zero.store( 0, memory_order_relaxed );
}
diffs.fetch_add( sum, memory_order_relaxed );
};
vector<jthread> threads;
threads.reserve( hc - 1 );
for( unsigned t = hc - 1; t; --t )
threads.emplace_back( thr );
thr();
threads.resize( 0 );
cout << (double)diffs.load( memory_order_relaxed ) / (1'000.0 * hc) <<
endl;
}

My PC is a AMD 7950X 16-core system.
Bonita Montero
2024-09-13 16:49:51 UTC
Permalink
Sorry, wrong newsgroup.
Post by Bonita Montero
#include <iostream>
#include <barrier>
#include <thread>
#include <vector>
#if defined(_WIN32)
    #include <intrin.h>
#elif defined(__linux__)
    #include <x86intrin.h>
#endif
using namespace std;
int main()
{
    unsigned hc = thread::hardware_concurrency();
    barrier bar( hc );
    atomic_uint synch( hc );
    atomic_uint64_t zero( 0 );
    atomic_int64_t diffs( 0 );
    auto thr = [&]()
    {
        int64_t sum = 0;
        for( unsigned t = 1'000; t; --t )
        {
            bar.arrive_and_wait();
            if( synch.fetch_sub( 1, memory_order_relaxed ) > 1 )
                while( synch.load( memory_order_relaxed ) );
            uint64_t tsc = __rdtsc(), expected = 0;
            if( !zero.compare_exchange_weak( expected, tsc,
memory_order_relaxed, memory_order_relaxed ) )
                sum += abs( (int64_t)(expected - tsc) );
            bar.arrive_and_wait();
            synch.store( hc );
            zero.store( 0, memory_order_relaxed );
        }
        diffs.fetch_add( sum, memory_order_relaxed );
    };
    vector<jthread> threads;
    threads.reserve( hc - 1 );
    for( unsigned t = hc - 1; t; --t )
        threads.emplace_back( thr );
    thr();
    threads.resize( 0 );
    cout << (double)diffs.load( memory_order_relaxed ) / (1'000.0 * hc)
<< endl;
}
My PC is a AMD 7950X 16-core system.
Tim Rentsch
2024-09-13 17:16:05 UTC
Permalink
Post by Bonita Montero
#include <iostream>
#include <barrier>
#include <thread>
#include <vector>
#if defined(_WIN32)
#include <intrin.h>
#elif defined(__linux__)
#include <x86intrin.h>
#endif
using namespace std;
[...]
Wrong newsgroup, shit-for-brains.
Bonita Montero
2024-09-14 03:01:16 UTC
Permalink
Post by Tim Rentsch
Post by Bonita Montero
#include <iostream>
#include <barrier>
#include <thread>
#include <vector>
#if defined(_WIN32)
#include <intrin.h>
#elif defined(__linux__)
#include <x86intrin.h>
#endif
using namespace std;
[...]
Wrong newsgroup, shit-for-brains.
It's more about the general principle that the interconnect between the
CPU-cores of my PC is that fast that the cores can synchronize to about
1'000 clock cycles.
Bonita Montero
2024-09-14 03:01:16 UTC
Permalink
Post by Tim Rentsch
Post by Bonita Montero
#include <iostream>
#include <barrier>
#include <thread>
#include <vector>
#if defined(_WIN32)
#include <intrin.h>
#elif defined(__linux__)
#include <x86intrin.h>
#endif
using namespace std;
[...]
Wrong newsgroup, shit-for-brains.
It's more about the general principle that the interconnect between the
CPU-cores of my PC is that fast that the cores can synchronize to about
1'000 clock cycles.
Lawrence D'Oliveiro
2024-09-16 06:43:29 UTC
Permalink
This posting was completely baffling to me, until I realized ...
Bonita Montero
2024-09-16 10:56:36 UTC
Permalink
Post by Lawrence D'Oliveiro
This posting was completely baffling to me, until I realized ...
I'm from Europe and I can handle both types of decimal points.
Paul
2024-09-16 19:15:09 UTC
Permalink
Post by Bonita Montero
Post by Lawrence D'Oliveiro
This posting was completely baffling to me, until I realized ...
I'm from Europe and I can handle both types of decimal points.
I'm from "somewhere" and 1000 is 1000 here. Punctuation
is for Excel spreadsheets :-)

*******

The hardware has a declaration, so in principle you don't
even have to measure anything.

"Since the family 10h (Barcelona/Phenom), AMD chips feature a constant TSC,
which can be driven either by the HyperTransport speed or the highest P state.
A CPUID bit (Fn8000_0007:EDX_8) advertises this;
Intel-CPUs also report their invariant TSC on that bit."

In Linux, there is "constant_tsc" in the CPU feature list.
Both my machines have it (an Intel machine ten years old,
an AMD machine two years old). Your machine would list it
in /proc/cpuinfo.

So you didn't write that code for your 7950X, since you
could just check the CPU feature bit instead for the
property of "constant_tsc", AKA invariant TSC. Your CPU
is not "synchronized" -- the hardware just does not vary
across the face of the CPU. It's like an entirely different
feature in a sense.

The way Wiki puts this:

"The specific processor configuration determines the behavior.
Constant TSC behavior ensures that the duration of each clock tick
is uniform and makes it possible to use the TSC as a wall-clock timer
even if the processor core changes frequency. This is the
architectural behavior for all later Intel processors."

Paul

Loading...