<!--- Licensed to the Apache Software Foundation (ASF) under one -->
<!--- or more contributor license agreements.  See the NOTICE file -->
<!--- distributed with this work for additional information -->
<!--- regarding copyright ownership.  The ASF licenses this file -->
<!--- to you under the Apache License, Version 2.0 (the -->
<!--- "License"); you may not use this file except in compliance -->
<!--- with the License.  You may obtain a copy of the License at -->

<!---   http://www.apache.org/licenses/LICENSE-2.0 -->

<!--- Unless required by applicable law or agreed to in writing, -->
<!--- software distributed under the License is distributed on an -->
<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
<!--- KIND, either express or implied.  See the License for the -->
<!--- specific language governing permissions and limitations -->
<!--- under the License. -->

# Data Loading API

## Overview

This document summarizes supported data formats and iterator APIs to read the
data including:

```eval_rst
.. autosummary::
    :nosignatures:

    mxnet.io
    mxnet.recordio
    mxnet.image
```

First, let's see how to write an iterator for a new data format.
The following iterator can be used to train a symbol whose input data variable has
name `data` and input label variable has name `softmax_label`.
The iterator also provides information about the batch, including the
shapes and name.

```python
>>> nd_iter = mx.io.NDArrayIter(data={'data':mx.nd.ones((100,10))},
...                             label={'softmax_label':mx.nd.ones((100,))},
...                             batch_size=25)
>>> print(nd_iter.provide_data)
[DataDesc[data,(25, 10L),<type 'numpy.float32'>,NCHW]]
>>> print(nd_iter.provide_label)
[DataDesc[softmax_label,(25,),<type 'numpy.float32'>,NCHW]]
```

Let's see a complete example of how to use data iterator in model training.
```python
>>> data = mx.sym.Variable('data')
>>> label = mx.sym.Variable('softmax_label')
>>> fullc = mx.sym.FullyConnected(data=data, num_hidden=1)
>>> loss = mx.sym.SoftmaxOutput(data=fullc, label=label)
>>> mod = mx.mod.Module(loss, data_names=['data'], label_names=['softmax_label'])
>>> mod.bind(data_shapes=nd_iter.provide_data, label_shapes=nd_iter.provide_label)
>>> mod.fit(nd_iter, num_epoch=2)
```

A detailed tutorial is available at
[Iterators - Loading data](http://mxnet.io/tutorials/basic/data.html).

## Data iterators

```eval_rst
    .. currentmodule:: mxnet
```

```eval_rst
.. autosummary::
    :nosignatures:

    io.NDArrayIter
    io.CSVIter
    io.LibSVMIter
    io.ImageRecordIter
    io.ImageRecordInt8Iter
    io.ImageRecordUInt8Iter
    io.MNISTIter
    recordio.MXRecordIO
    recordio.MXIndexedRecordIO
    image.ImageIter
    image.ImageDetIter
```

## Helper classes and functions

### Data structures and other iterators

```eval_rst
.. autosummary::
    :nosignatures:

    io.DataDesc
    io.DataBatch
    io.DataIter
    io.ResizeIter
    io.PrefetchingIter
    io.MXDataIter
```

### Functions to read and write RecordIO files

```eval_rst
.. autosummary::
    :nosignatures:

    recordio.pack
    recordio.unpack
    recordio.unpack_img
    recordio.pack_img
```

## How to develop a new iterator

Writing a new data iterator in Python is straightforward. Most MXNet
training/inference programs accept an iterable object with ``provide_data``
and ``provide_label`` properties.
This [tutorial](http://mxnet.io/tutorials/basic/data.html) explains how to
write an iterator from scratch.

The following example demonstrates how to combine
multiple data iterators into a single one. It can be used for multiple
modality training such as image captioning, in which images are read by
``ImageRecordIter`` while documents are read by ``CSVIter``

```python
class MultiIter:
    def __init__(self, iter_list):
        self.iters = iter_list
    def next(self):
        batches = [i.next() for i in self.iters]
        return DataBatch(data=[*b.data for b in batches],
                         label=[*b.label for b in batches])
    def reset(self):
        for i in self.iters:
            i.reset()
    @property
    def provide_data(self):
        return [*i.provide_data for i in self.iters]
    @property
    def provide_label(self):
        return [*i.provide_label for i in self.iters]

iter = MultiIter([mx.io.ImageRecordIter('image.rec'), mx.io.CSVIter('txt.csv')])
```

Parsing and performing another pre-processing such as augmentation may be expensive.
If performance is critical, we can implement a data iterator in C++. Refer to
[src/io](https://github.com/dmlc/mxnet/tree/master/src/io) for examples.

### How to change the batch layout

By default, the backend engine treats the first dimension of each data and label variable in data
iterators as the batch size (i.e. `NCHW` or `NT` layout). In order to override the axis for batch size,
the `provide_data` (and `provide_label` if there is label) properties should include the layouts. This
is especially useful in RNN since `TNC` layouts are often more efficient. For example:

```python
@property
def provide_data(self):
    return [DataDesc(name='seq_var', shape=(seq_length, batch_size), layout='TN')]
```
The backend engine will recognize the index of `N` in the `layout` as the axis for batch size.

## API Reference

<script type="text/javascript" src='../../../_static/js/auto_module_index.js'></script>

### mxnet.io - Data Iterators

```eval_rst
.. automodule:: mxnet.io
    :noindex:
    :members: NDArrayIter, CSVIter, LibSVMIter, ImageRecordIter, ImageRecordUInt8Iter, MNISTIter
```

### mxnet.io - Helper Classes & Functions

```eval_rst
.. automodule:: mxnet.io
    :noindex:
    :members: DataBatch, DataDesc, DataIter, MXDataIter, PrefetchingIter, ResizeIter
```

### mxnet.recordio

```eval_rst
.. currentmodule:: mxnet.recordio

.. automodule:: mxnet.recordio
    :members:

```

```eval_rst
.. _name: mxnet.symbol.Symbol.name
.. _shape: mxnet.ndarray.NDArray.shape

```

<script>auto_index("api-reference");</script>
