<!--- Licensed to the Apache Software Foundation (ASF) under one -->
<!--- or more contributor license agreements.  See the NOTICE file -->
<!--- distributed with this work for additional information -->
<!--- regarding copyright ownership.  The ASF licenses this file -->
<!--- to you under the Apache License, Version 2.0 (the -->
<!--- "License"); you may not use this file except in compliance -->
<!--- with the License.  You may obtain a copy of the License at -->

<!---   http://www.apache.org/licenses/LICENSE-2.0 -->

<!--- Unless required by applicable law or agreed to in writing, -->
<!--- software distributed under the License is distributed on an -->
<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
<!--- KIND, either express or implied.  See the License for the -->
<!--- specific language governing permissions and limitations -->
<!--- under the License. -->

# Use data from S3 for training

AWS S3 is a cloud-based object storage service that allows storage and retrieval of large amounts of data at a very low cost. This makes it an attractive option to store large training datasets. MXNet is deeply integrated with S3 for this purpose.

An S3 protocol URL (like `s3://bucket-name/training-data`) can be provided as a parameter for any data iterator that takes a file path as input. For example,

```
data_iter = mx.io.ImageRecordIter(
    path_imgrec="s3://bucket-name/training-data/caltech_train.rec",
    data_shape=(3, 227, 227),
    batch_size=4,
    resize=256)
```
Following are detailed instructions on how to use data from S3 for training.

## Step 1: Build MXNet with S3 integration enabled

Follow instructions [here](http://mxnet.io/install/index.html) to install MXNet from source with the following additional steps to enable S3 integration.

1. Install `libcurl4-openssl-dev` and `libssl-dev` before building MXNet. These packages are required to read/write from AWS S3.
2. Append `USE_S3=1` to `config.mk` before building MXNet.
    ```
    echo "USE_S3=1" >> config.mk
    ```

## Step 2: Configure S3 authentication tokens

MXNet requires the S3 environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to be set. [Here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/) are instructions to get the access keys from AWS console.

```
export AWS_ACCESS_KEY_ID=<your-access-key-id>
AWS_SECRET_ACCESS_KEY=<your-secret-access-key>
```

## Step 3: Upload data to S3

There are several ways to upload data to S3. One easy way is to use the AWS command line utility. For example, the following `sync` command will recursively copy contents from a local directory to a directory in S3.

```
aws s3 sync ./training-data s3://bucket-name/training-data
```

## Step 4: Train with data from S3

Once the data is in S3, it is very straightforward to use it from MXNet. Any data iterator that can read/write data from a local drive can also read/write data from S3.

Let's modify an existing example code in MXNet repository to read data from S3 instead of local disk. [`mxnet/tests/python/train/test_conv.py`](https://github.com/dmlc/mxnet/blob/master/tests/python/train/test_conv.py) trains a convolutional network using MNIST data from local disk. We'll do the following change to read the data from S3 instead.

```
~/mxnet$ sed -i -- 's/data\//s3:\/\/bucket-name\/training-data\//g' ./tests/python/train/test_conv.py

~/mxnet$ git diff ./tests/python/train/test_conv.py
diff --git a/tests/python/train/test_conv.py b/tests/python/train/test_conv.py
index 039790e..66a60ce 100644
--- a/tests/python/train/test_conv.py
+++ b/tests/python/train/test_conv.py
@@ -39,14 +39,14 @@ def get_iters():

     batch_size = 100
     train_dataiter = mx.io.MNISTIter(
-            image="data/train-images-idx3-ubyte",
-            label="data/train-labels-idx1-ubyte",
+            image="s3://bucket-name/training-data/train-images-idx3-ubyte",
+            label="s3://bucket-name/training-data/train-labels-idx1-ubyte",
             data_shape=(1, 28, 28),
             label_name='sm_label',
             batch_size=batch_size, shuffle=True, flat=False, silent=False, seed=10)
     val_dataiter = mx.io.MNISTIter(
-            image="data/t10k-images-idx3-ubyte",
-            label="data/t10k-labels-idx1-ubyte",
+            image="s3://bucket-name/training-data/t10k-images-idx3-ubyte",
+            label="s3://bucket-name/training-data/t10k-labels-idx1-ubyte",
             data_shape=(1, 28, 28),
             label_name='sm_label',
             batch_size=batch_size, shuffle=True, flat=False, silent=False)
```

After the above change `test_conv.py` will fetch data from S3 instead of the local disk.

```
python ./tests/python/train/test_conv.py
[21:59:19] src/io/s3_filesys.cc:878: No AWS Region set, using default region us-east-1
[21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(100,1,28,28)
[21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(100,1,28,28)
INFO:root:Start training with [cpu(0)]
Start training with [cpu(0)]
INFO:root:Epoch[0] Resetting Data Iterator
Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=11.277
Epoch[0] Time cost=11.277
INFO:root:Epoch[0] Validation-accuracy=0.955100
Epoch[0] Validation-accuracy=0.955100
INFO:root:Finish fit...
Finish fit...
INFO:root:Finish predict...
Finish predict...
INFO:root:final accuracy = 0.955100
final accuracy = 0.955100
```
