Proud to be among the authors of the top 3 papers for 2020

  Proud to be among the authors of the top 3 papers for 2020: 1. ViT (Google) - Cited by 11914 papers / 12 authors. 2. GPT-3 (OpenAI) - Cited by 8070 papers / 31 authors. 3. YOLOv4 (Academia Sinica) - Cited by 8014 papers / 3 authors:  Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao ( google scholar ) From some discussion on Twitter: Ranking based on Google scholar: -the-100-most-cited-ai-papers-in-2022

Vision Transformers for Dense Prediction (ICCV 2021): State of the art accuracy on depth estimation and semantic segmentation (realtime >30 FPS)

Vision Transformers for Dense Prediction:  (ICCV 2021) State of the art accuracy on depth estimation and semantic segmentation (realtime >30 FPS) We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is availa

Scaled YOLO v4 (CVPR2021): Absolute Top-1 neural network for object detection on MS COCO dataset

Scaled YOLO v4 is the best neural network for object detection on MS COCO dataset Scaled YOLO v4 (CVPR2021) outperforms neural networks in accuracy: Google EfficientDet D7x / DetectoRS or SpineNet-190(self-trained on extra-data) Amazon Cascade-RCNN ResNest200 Microsoft RepPoints v2 Facebook RetinaNet SpineNet-190 And many others… Scaled YOLOv4 is more accurate and faster than neural networks: Google EfficientDet D0-D7x Google SpineNet S49s — S143 Baidu Paddle-Paddle PP YOLO And many others… Scaled YOLO v4 is a series of neural networks built on top of the improved and scaled YOLOv4 network. Our neural network was trained from scratch without using pre-trained weights (Imagenet or any other).  The YOLOv4-tiny neural network speed reaches 1774 FPS on a gaming graphics card GPU RTX 2080Ti when using TensorRT + tkDNN (batch = 4, FP16) Read full article : Read CVPR 2021 paper :

YOLOv4: Optimal Speed and Accuracy of Object Detection

YOLOv4: Optimal Speed and Accuracy of Object Detection There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS CO

C++: Thread-safe std::map with the speed of lock-free map

Introduction:  Examples of use and testing of a thread-safe pointer and contention-free shared-mutex.  In this article we will show additional optimizations, examples of use and testing of a thread-safe pointer developed by us with an optimized shared mutex  contfree_safe_ptr<T>  - this is equivalent to  safe_ptr<T, contention_free_shared_mutex<>> In the end, we will show the comparative graphs of our thread-safe-pointer tests and some of the best lock-free algorithms from libCDS on Intel Core i5 / i7, Xeon, 2 x Xeon processors. You can find libCDS by the link: In all benchmarks in this article used this commit of libCDS Read full  article : Three related articles:   1.  We make any object thread-safe   2.  We make a std

C++: We make a std::shared_mutex 10 times faster

Introduction:  High performance of lock-based data structures.  In this article, we will detail the atomic operations and C++11 memory barriers and the assembler instructions generated by it on x86_64 CPUs. Next, we’ll show how to speed up the work of  contfree_safe_ptr<std::map>  up to the level of complex and optimized lock-free data structures that are similar to  std::map<>  in terms of their functionality, for example:  SkipListMap  and  BronsonAVLTreeMap  from libCDS library (Concurrent Data Structures library): And we can get such multi-threaded performance for any of your initially non-thread-safe T-class used as  contfree_safe_ptr<T>   –  it is   safe_ptr<T, contention_free_shared_mutex>   class, where  contention_free_shared_mutex   is own optimized shared-mutex. Namely, we will show how to realize your own high-performance contention-free shared-mutex, which almost does not conflict on readings. We

C++: We make any object thread-safe

Introduction:  Smart pointer that makes any object thread-safe for any operations, with the performance equal to that of optimized lock-free containers. In these 3 articles I’ll tell in detail about atomic operations, memory barriers and the rapid exchange of data between threads, as well as about the “sequence-points” by the example of “execute-around-idiom”. At the same time, we will try to do something useful together. There are no thread-safe containers (array, list, map ...) in the standard C++ library, which could be used in multiple threads without additional locks. In case of usage of standard containers for multithreading exchange, there is a danger for you to forget to protect one of the code sections with mutex or by mistake to protect it with another mutex. Obviously, a developer will make more mistakes if he/she uses his/her own solution instead of a standard one. And if the task is so complex that there is no any standard solution, then the developer, while

The fastest interconnect for hundreds of CPUs, GPUs and FPGAs (make a supercomputer)

About the fastest inter-server and inter-device communication with bandwidth more than 100 Gbit/sec and latency 0.5 usec