What is causing the GPU out-of-memory error(OOM) for my Sequence-to-Sequence network with LSTM?

by IronEdward   Last Updated January 14, 2018 09:19 AM

I'm currently attempting to make a Seq2Seq Chatbot with LSTMs. The data I used is from Cornell's Movie Dialog Corpus.

Here's the link to my code on GitHub, I would appreciate it if you took a look at it: Seq2Seq Chatbot You need to change the path of the file in order for it to run correctly.

I'm using 2 GTX 1080 with 8GB RAM, and I'm training my code with GPU support.

Here's the error I got:

.
.
.
2018-01-14 17:09:40.102649: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1002348032 totalling 955.91MiB
2018-01-14 17:09:40.102656: I tensorflow/core/common_runtime/bfc_allocator.cc:683] Sum Total of in-use chunks: 7.30GiB
2018-01-14 17:09:40.102665: I tensorflow/core/common_runtime/bfc_allocator.cc:685] Stats:
Limit:                  7968181453
InUse:                  7836243200
MaxInUse:               7836262144
NumAllocs:                   48210
MaxAllocSize:           1002348032

2018-01-14 17:09:40.103459: W tensorflow/core/common_runtime/bfc_allocator.cc:277] **********************_******************************xx*****************************************xxxx
2018-01-14 17:09:40.103484: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocatingtensor with shape[3,1143,44592]
Traceback (most recent call last):
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,1143,44592]
         [[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
         [[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 100, in <module>
    model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=3, epochs=epochs, validation_split=0.)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1657, in fit
    validation_steps=validation_steps)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1213, in _fit_loop
    outs = f(ins_batch)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 2357, in __call__
    **self.session_kwargs)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,1143,44592]
         [[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
         [[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'training/RMSprop/gradients/dense_1/Max_grad/Cast', defined at:
  File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 100, in <module>
    model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=3, epochs=epochs, validation_split=0.)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 1634, in fit
    self._make_train_function()
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/training.py", line 990, in _make_train_function
    loss=self.total_loss)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/optimizers.py", line 225, in get_updates
    grads = self.get_gradients(loss, params)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/optimizers.py", line 73, in get_gradients
    grads = K.gradients(loss, params)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 2394, in gradients
    return tf.gradients(loss, variables, colocate_gradients_with_ops=True)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in gradients
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 353, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gradients_impl.py", line 581, in <lambda>
    grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_grad.py", line 87, in _MaxGrad
    return _MinOrMaxGrad(op, grad)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_grad.py", line 77, in _MinOrMaxGrad
    indicators = math_ops.cast(math_ops.equal(y, op.inputs[0]), grad.dtype)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 745, in cast
    return gen_math_ops.cast(x, base_type, name=name)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 892, in cast
    "Cast", x=x, DstT=DstT, name=name)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

...which was originally created as op 'dense_1/Max', defined at:
  File "/home/edward/デスクトップ/Chatbot/Seq2Seq Model/main (One-hot Batch).py", line 73, in <module>
    decoder_outputs = decoder_dense(decoder_outputs)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/engine/topology.py", line 603, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/layers/core.py", line 847, in call
    output = self.activation(output)
  File "/home/edward/.local/lib/python3.4/site-packages/keras/activations.py", line 26, in softmax
    e = K.exp(x - K.max(x, axis=axis, keepdims=True))
  File "/home/edward/.local/lib/python3.4/site-packages/keras/backend/tensorflow_backend.py", line 1213, in max
    return tf.reduce_max(x, axis=axis, keep_dims=keepdims)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/math_ops.py", line 1525, in reduce_max
    name=name)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2485, in _max
    keep_dims=keep_dims, name=name)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,1143,44592]
         [[Node: training/RMSprop/gradients/dense_1/Max_grad/Cast = Cast[DstT=DT_FLOAT, SrcT=DT_BOOL, _class=["loc:@dense_1/Max"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/RMSprop/gradients/dense_1/Max_grad/Equal)]]
         [[Node: loss/mul/_93 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3108_loss/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Exception ignored in: <bound method Session.__del__ of <tensorflow.python.client.session.Session object at 0x7f18568a7e10>>
Traceback (most recent call last):
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 696, in __del__
  File "/home/edward/.local/lib/python3.4/site-packages/tensorflow/python/framework/c_api_util.py", line 30, in __init__
TypeError: 'NoneType' object is not callable

It's telling me that I ran out of memory.

The funny thing is, I'm currently dividing my input sentences into batches of 20(I changed this value, description further down below) and training them, and I see that the program is dividing a lot of chunks of the memory(i.e., Something similar to 2018-01-14 17:09:40.102649: I tensorflow/core/common_runtime/bfc_allocator.cc:679] 1 Chunks of size 1002348032 totalling 955.91MiBis printed to the console a ridiculous number of times) for just this one batch of data, and the program never gets to the next batch.

Following similar issues to mine I saw on the internet,I've tried:

  • Reducing memory usage by changing the data type: float32 -> float16
  • Reduced the batch size to 10, 5 and then 3
  • Reduced the epochs to 3

and none of these worked. The majority of the similar issues I saw were all using image data, by the way.

I'm thinking it may be related to the size of one batch(it's a numpy array of [3,1143,44592]), or simply an error in my code; but honestly, I'm terribly stuck right now.

Any help would be greatly appreciated!



Related Questions



Using A Compressed Dataset with Keras LSTM

Updated May 18, 2017 22:19 PM