Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

An illegal memory access was encountered #45

Open
PkuRainBow opened this issue Oct 27, 2018 · 10 comments
Open

An illegal memory access was encountered #45

PkuRainBow opened this issue Oct 27, 2018 · 10 comments
Labels
question Further information is requested

Comments

@PkuRainBow
Copy link

PkuRainBow commented Oct 27, 2018

🐛 Bug

I just run the below script with 4 x P100.

PYTHON="/root/miniconda3/bin/python"
CONFIG="./configs/e2e_mask_rcnn_R_50_FPN_1x.yaml"

export NGPUS=4
${PYTHON} -m torch.distributed.launch --nproc_per_node=$NGPUS \
	./tools/train_net.py --config-file $CONFIG

Expected behavior

Here is the error information,
image

It seems that the first two few iterations are ok. (iter: 0, 20)

Then in the iter 40, the number in the bracket becomes nan. Then I got the error informing me that an illegal memory was encountered.

Environment

I just install all the enviroments follow the instructions

  • PyTorch Version 1.0
  • Linux 16.04
  • Python version: 3.6
  • CUDA/cuDNN version: 8.0
  • GPU models and configuration: 4 X P100
@fmassa
Copy link
Contributor

fmassa commented Oct 27, 2018

Could you give more information?

I suspect it happens because you used a too high learning rate, and training diverged, giving large indices.

@PkuRainBow
Copy link
Author

PkuRainBow commented Oct 27, 2018

Could you give more information?

I suspect it happens because you used a too high learning rate, and training diverged, giving large indices.

@fmassa , Thanks for you quick reply.
Here I paste the default yaml file,

MODEL:
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHT: "catalog://ImageNetPretrained/MSRA/R-50"
  BACKBONE:
    CONV_BODY: "R-50-FPN"
    OUT_CHANNELS: 256
  RPN:
    USE_FPN: True
    ANCHOR_STRIDE: (4, 8, 16, 32, 64)
    PRE_NMS_TOP_N_TRAIN: 2000
    PRE_NMS_TOP_N_TEST: 1000
    POST_NMS_TOP_N_TEST: 1000
    FPN_POST_NMS_TOP_N_TEST: 1000
  ROI_HEADS:
    USE_FPN: True
  ROI_BOX_HEAD:
    POOLER_RESOLUTION: 7
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    POOLER_SAMPLING_RATIO: 2
    FEATURE_EXTRACTOR: "FPN2MLPFeatureExtractor"
    PREDICTOR: "FPNPredictor"
  ROI_MASK_HEAD:
    POOLER_SCALES: (0.25, 0.125, 0.0625, 0.03125)
    FEATURE_EXTRACTOR: "MaskRCNNFPNFeatureExtractor"
    PREDICTOR: "MaskRCNNC4Predictor"
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 2
    RESOLUTION: 28
    SHARE_BOX_FEATURE_EXTRACTOR: False
  MASK_ON: True
DATASETS:
  TRAIN: ("coco_2014_train", "coco_2014_valminusminival")
  TEST: ("coco_2014_minival",)
DATALOADER:
  SIZE_DIVISIBILITY: 32
SOLVER:
  BASE_LR: 0.02
  # BASE_LR: 0.0025
  WEIGHT_DECAY: 0.0001
  STEPS: (60000, 80000)
  MAX_ITER: 90000
  # IMS_PER_BATCH: 2


@fmassa
Copy link
Contributor

fmassa commented Oct 27, 2018

So, you have changed the IMS_PER_BATCH to be 2, and the learning rate as well?

@fmassa
Copy link
Contributor

fmassa commented Oct 27, 2018

Try following the learning rate adaptation rules that I mentioned in the README, they are necessary for training to not diverge

@PkuRainBow
Copy link
Author

@fmassa I still can not figure the problem.

@fmassa
Copy link
Contributor

fmassa commented Oct 29, 2018

So, to double check:

  • you are using 4 GPUs
  • you set IMS_PER_BATCH to 2

Is that right?

Note that the meaning of IMS_PER_BATCH is different in maskrcnn-benchmark than it is from Detectron.
If you use fewer GPUs than 8, then you might need to change s few hyper parameters for training to behave the same.
Have a look at https://github.com/facebookresearch/maskrcnn-benchmark#single-gpu-training for the differences and what to do.

Let me know if you still have problems

@fmassa fmassa added the question Further information is requested label Oct 29, 2018
@PkuRainBow
Copy link
Author

@fmassa Thanks for your kind help.

I will update if I have got progress.

@zimenglan-sysu-512
Copy link
Contributor

zimenglan-sysu-512 commented Dec 4, 2018

hi @fmassa
after several thousands iterations or several tens of thousands iterations, the loss become NaN

2018-12-04 07:02:12,736 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:11:04  iter: 38300  loss: 0.4934 (0.6051)  loss_classifier: 0.2030 (0.2690)  loss_box_reg: 0.1892 (0.2426)  loss
_objectness: 0.0369 (0.0527)  loss_rpn_box_reg: 0.0336 (0.0409)  time: 1.0707 (1.0797)  data: 0.0126 (0.0124)  lr: 0.010000  max mem: 3778
2018-12-04 07:02:34,353 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:10:43  iter: 38320  loss: 0.5649 (0.6050)  loss_classifier: 0.2554 (0.2689)  loss_box_reg: 0.2274 (0.2426)  loss
_objectness: 0.0426 (0.0527)  loss_rpn_box_reg: 0.0374 (0.0409)  time: 1.0791 (1.0797)  data: 0.0115 (0.0124)  lr: 0.010000  max mem: 3778
2018-12-04 07:02:54,637 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:10:15  iter: 38340  loss: nan (nan)  loss_classifier: 0.2202 (nan)  loss_box_reg: nan (nan)  loss_objectness: na
n (nan)  loss_rpn_box_reg: nan (nan)  time: 1.0134 (1.0797)  data: 0.0101 (0.0124)  lr: 0.010000  max mem: 3778
2018-12-04 07:03:13,254 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:09:39  iter: 38360  loss: nan (nan)  loss_classifier: nan (nan)  loss_box_reg: nan (nan)  loss_objectness: nan (
nan)  loss_rpn_box_reg: nan (nan)  time: 0.9273 (1.0796)  data: 0.0099 (0.0124)  lr: 0.010000  max mem: 3778
2018-12-04 07:03:31,830 maskrcnn_benchmark.trainer INFO: eta: 2 days, 5:09:04  iter: 38380  loss: nan (nan)  loss_classifier: nan (nan)  loss_box_reg: nan (nan)  loss_objectness: nan (
nan)  loss_rpn_box_reg: nan (nan)  time: 0.9140 (1.0795)  data: 0.0100 (0.0124)  lr: 0.010000  max mem: 3778

do u have ideas to solve it?

@fmassa
Copy link
Contributor

fmassa commented Dec 4, 2018

@zimenglan-sysu-512 difficult to say without more context. Is this COCO? Are you using a standard model or have you adapted one of the models? It might require some digging to understand where the problem might come from.

@zimenglan-sysu-512
Copy link
Contributor

zimenglan-sysu-512 commented Dec 4, 2018

hi @fmassa
i want to add light-head rcnn to train R-50-C4 on COCO dataset, maybe something wrong in my code to implement. i need to check my code.
thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants