Growing model complexity and data volumes in Machine Learning (ML), especially Deep Learning (DL), necessitate parallel processing for efficient, scalable computation. Benchmarking is critical for evaluating parallel ML techniques. Current methodologies predominantly emphasize quantitative metrics like throughput and accuracy, exemplified by MLPerf and HPC-AI500. Such quantitative focus neglects key qualitative factors-portability, robustness, deployment complexity, and usefulness-essential for practical parallel ML. This paper argues for a paradigm shift in benchmarking, advocating for a holistic evaluation framework that inherently integrates qualitative assessments alongside traditional quantitative measures. To this end, we introduce NNoPP (Neural Network on-top-of Parallel Processing), a structured evaluation approach designed to complement existing benchmarks. NNoPP proposes criteria---novelty, portability, performance, scalability, complexity, usefulness, and sustainability---methodically defined for real-world ML solution viability, sustainability, and impact. Beyond performance numbers, these metrics promote sustainable solutions: computationally efficient, practically deployable, broadly applicable, and robust across diverse contexts. Through illustrative case studies across key DL model families, we demonstrate the application of NNoPP and its capacity to provide nuanced insights beyond conventional benchmark rankings. Future directions for parallel ML benchmarking emphasize integrated metrics, dynamic workloads, domain-specific relevance, and collaborative community evolution. Ultimately, this paper emphasizes a re-envisioned holistic benchmarking approach that fosters the adoption of parallel ML solutions that are both quantitatively superior and qualitatively robust, useful, and sustainably impactful for continuous advancement.
This paper considers distributed algorithms where asynchronous processes cooperate through a shared memory in the adversarial context where both the computing entities (namely sequential processes) are anonymous and the communication medium is made up of shared registers that have no global names in the following sense: a shared register known as A by a process p can be known as Z by another process q (and this is unknown to the processes). So the crucial question is then: what can be computed in such a very weak addressing computing model? The paper is a short visit that answers this question.
Off-chain blockchain contracts are useful for parties who wish to keep the details of their contract's business logic private. The ad hoc installation of blockchain instances on mutually agreed-upon servers forms a reliable infrastructure for such contracts. We detail the core post-quantum algorithmic ingredients for structuring ad hoc blockchain (AHB) infrastructure and contracts. The initialization of AHB involves agreement on servers and blockchain software, such as Ethereum or the post-quantum SodsBC versions [13]. These choices, in turn, are approved by the digital signatures of the contract binding parties.
A contract in AHB is executed by an MPC-based zero-knowledge proof of a hash (SHA) preimage commitment. Our efficient implementation extends zero-knowledge proofs based on MPC-in-the-Head (or MitH, in short) to the case of distributed verified signatures with a global zero-knowledge verification of knowledge of secret shared preimage. AHB guarantees that the preimage is released when a group with a predefined number of participants running the AHB agrees that the off-chain contract conditions are met.
A new key aspect of AHB is the opportunity to eliminate all but the current state of the MPC history while executing the MPC, as we can verify that the result is identical to the declared SHA value of the preimage. Although AHB is a multi-round MPC, it transfers and stores less data than MitH. Furthermore, it maintains the same level of privacy, as participants are exposed solely to provably random information.
A modular version of the baskets queue of Hoffman, Shalev and Shavit is presented. It manipulates the head and tail using a novel object called load-link/increment-conditional, which admits implementations based solely on READ/WRITE instructions, as well as versions that internally use CAS. These variants offer flexibility in balancing simplicity and performance, and admits implementations that spread contention. While other components of the queue still require stronger atomic primitives such as CAS or FAI, this separation highlights the potential for designs that mitigate the seemingly inherent bottleneck in previous queue implementations that manipulate the head and tail using read-modify-write (RMW) instructions over a single shared register.
We propose a modular queue algorithm that isolates head/tail coordination from other synchronization, enabling simpler and potentially more scalable composition of concurrent components. An experimental evaluation supports the theoretical results, showing that the proposed queue achieves performance comparable to that of existing state-of-the-art implementations. While no scalability improvements are observed, the modular approach offers conceptual clarity and flexibility, and may serve as a foundation for future concurrent queue designs.
Byte-addressable non-volatile memory, provided by novel technologies such as Intel Optane and CXL, has presented opportunities for developing high-performance concurrent objects (including concurrent data structures) with the added benefits of durability and recov-erability. However, developing such data structures while preserving high performance still remains challenging due to the latency and bandwidth gaps between volatile and non-volatile memory. In this paper, we investigate the construction of durably linearizable objects using our recently developed Mangosteen framework.
Mangosteen's frontend combines an efficient concurrency control mechanism based on flat-combining, and uses dynamic binary instrumentation to capture updates to the application state (with store instruction deduplication for further efficiency). This interfaces with an asynchronous persistency back-end that maintains a redo log and a non-volatile application state, to support recovery. Mangosteen is agnostic to the specifics of the underlying implementation and transforms any linearizable object (including a sequential object) into a durably linearizable counterpart supporting unlimited read-read concurrency. Moreover, the transformation is fully transparent, requiring almost no intervention from the end user. We demonstrate this via a sequential linked list queue as well as both the lock-free and blocking versions of the Michael-Scott queue. We show that Mangosteen outperforms the state-of-the-art approach based on FliT under high concurrency.