何恺明MAE局限性被打破与Swin Transformer结合训练速度提升.docx
-
资源ID:133961
资源大小:174.32KB
全文页数:5页
- 资源格式: DOCX
下载积分:3金币
快捷下载

账号登录下载
微信登录下载
三方登录下载:
友情提示
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
|
何恺明MAE局限性被打破与Swin Transformer结合训练速度提升.docx
自何恺明MAE横空出世以来,MIM (Masked Image Modeling)这一自监督预训练表征越来越引发关注。但与此同时,研究人员也不得不思考它的局限性。MAE论文中只尝试了使用原版ViT架构作为编码器,而表现更好的分层设计结构(以Swin Transformer为代表),并不能直接用上MAE方法。于是,一场整合的范式就此在研究团队中上演。代表工作之一是来自清华、微软亚研院以及西安交大提出SimMIM,它探索了 Swin Transformer在MIM中的应用。但与MAE相比,它在可见和掩码图块均有操作,且计算量过大。有研究人员发现,即便是SimMIM的基本尺寸模型,也无法在一台配置8个32GB GPU的机器上完成训练。基于这样的背景,东京大学&商汤&悉尼大学的研究员,提供一个新思路。Green Hierarchical Vision Transformerfor Masked Image ModelingLang Huang Shan You'; Mingkai Zheng Fei Wang2t Chen Qian Toshihiko Yamasaki11The University of Tokyo; 2ScnseTimc Research; 3The University of Sydney(langhuang, yamasakiOcvm. t u-tokyo .ac.jpyoushan.wangfei,qianchen)sensetimecom. mzhe4001CuniSydneyedu.u-i不光将Swin Transformer整合到了 MAE框架上,既有与SimMIM相当的任务表现,还保证了计算效率和性能将分层ViT的训练速度提高2.7倍,GPU内存使用量减少70%0来康康这是一项什么研究?当分层设计引入MAE这篇论文提出了一种面向MIM的绿色分层视觉Transformer。即允许分层ViT丢弃掩码图块,只对可见图块进行操作。Stagel Stage2 Stage3 Stage4EncoderGreen Httirdrchical Vi l with Group Window Attention一| MSE |MSEMSE»| MSE |»|mse| MSE |Decoder:Isotropic vrrMethod Overview.具体实现,由两个关键部分组成。首先,设计了一种基于分治策略的群体窗口注意力方案。将具有不同数量可见图块的局部窗口聚集成几个大小相等的组,然后在每组内进行掩码自注意力。 M3SK Lm Z1: 一MaskedAtterH一 onGroup Attention Scheme,其次,把上述分组任务视为有约束动态规划问题,受贪心算法的启发提出了一种分组算法。Algonthm 1 Optimal GroupingRequire: The number of visible patches within each local window 皿;?。,1: Minimum computational cost ce 4 +82: for% = maxX叫; i to E:'irt do3: Remaining windows 4> 叫; partition II <- 0; the number of group ng 1 04: repeat5: 万小 Knapsack(p,小).as in Equation (7)6: n4- nu7rng;力一6:万“07: % 1 % + 18: until <t> = 09: c - C, II), as in Equation (8)10: if c < c* thenii: c* <-c; rr - n12: end if13: end for14: return Optimal group partition IT它可以自适应选择最佳分组大小,并将局部窗口分成最少的一组,从而使分组图块上的注意力整体计算成本最小。表现相当,训练时间大大减少结果显示,在lmageNet-1K和MS-COCO数据集上实验评估表明,与基线SimMIM性能相当的同时,效率提升2倍以上。Table 2: Topi accuracy on the ImagcNet-lK validation set with the Swin-B or ViT-B models. Allmethods are trained with images of size 224 x 224 in both the pre-training and fine-tuning except forSimMIM 192 using 192 x 192 in ihc pre-training.MethodModel#ParamsPTEp.Ep Hours Total HoursFTEp.Acc.(%)Training from scratchScratch. DciT 56ViTB86M0一 A0081.8Scratch. MAE 22ViTB86M0* 300823Scratch. Swin 43Swin-B88M030083.5Supervised Pre-trainingSupervised, SimMIM 66Swin-B88M300 10083.3Supervised. SimMIM 66|Swin-LI97M300 10083.5Pre-training with Contrastive LearningMoCov3 IIViT-B86M800 10083.2DINO PlViTB86M800 10082.8Pre-training with Masked Image ModelingBEiT 2ViT-B86M80010083.2MaskFcat 62 JViTB86M80010084.0MAE 22ViTB86M16001.3206910083.6SimMIMm 66ViT-B86M8004.1330710083.8SimMIMis (66)Swin-B88M8002.0160910084.0SimMIM 192 (66Swin-L197M80035282110085.4OursSwin-B88M800LI88710083.7OursSwin-L197M800131067100854Tabic 3: MS-COCO object detection and instance segmentation All methods arc based on theMask R-CNN |24 architecture with the FPN (40 neck. The methods in gray arc cited from 39).Most of them use much longer training schedules and advanced data augmentations.S叩Supervised Pre-trainingMc>Cov3 Il|ViTB800BEiT 2|ViT-B800MAE 22ViT-B16002()69MAE |221ViT-B16(X)2069SimMIM (66|Swin-B8001609OursSwin-B800887I(xl(K25I(M36369 OC 7 94 413 4 0Ho.0.0.4 s- 5 59 5 - - 7 7.4 二.2 9-&6 69 4 142.4444.44'44.9 7ao.7 7MethodBackbone PTEp. PT Hours FT Epochs APb AP AP% APn, AP?i AP?iTraining from scratchBenchmarking 39J %TB0040048.9-43.6-Supervised PretrainingBenchmarking 39 ViT-B300992l(X)47.9,42.9PVT 60PVT-L3003644.566.048340.763.443.7Swin 1431Swin-B3008403648.569.853.243266.946.7而跟SimMIM相比,这一方法在所需训练时间大大减少,消耗GPU内存也小得多。具体而言,在相同的训练次数下,在Swin-B上提高2倍的速度和减少60%的内存。GPU HOURS/EPOCH Ours (224) BSimMIM (192) SimMIM (224)5.6Swin-BGPU MEMORY (GB)Group size 明Group we %Group size g.Figure 4: The optimal group size at each stage. The figure of the fourth stage is omitted herebecause there is only one local window in this stage, so the grouping is not necessary. The simulationis repeated 100 times, of which the mean and standard deviation (the shaded regions) are reported.6543210Figure I: Comparison with SimMIM in terms of efficiency. All methods use a Swin-B/SwinLbackbone and batch size of 2,048. The experiments of our method are conducted on a single machinewith e