<!--- Licensed to the Apache Software Foundation (ASF) under one -->
<!--- or more contributor license agreements.  See the NOTICE file -->
<!--- distributed with this work for additional information -->
<!--- regarding copyright ownership.  The ASF licenses this file -->
<!--- to you under the Apache License, Version 2.0 (the -->
<!--- "License"); you may not use this file except in compliance -->
<!--- with the License.  You may obtain a copy of the License at -->

<!---   http://www.apache.org/licenses/LICENSE-2.0 -->

<!--- Unless required by applicable law or agreed to in writing, -->
<!--- software distributed under the License is distributed on an -->
<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
<!--- KIND, either express or implied.  See the License for the -->
<!--- specific language governing permissions and limitations -->
<!--- under the License. -->

# Text API

## Overview

The `mxnet.contrib.text` APIs refer to classes and functions related to text data processing, such
as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the
`mxnet.ndarray.NDArray` format.

```eval_rst
.. warning:: This package contains experimental APIs and may change in the near future.
```

This document lists the text APIs in mxnet:

```eval_rst
.. autosummary::
    :nosignatures:

    mxnet.contrib.text.embedding
    mxnet.contrib.text.vocab
    mxnet.contrib.text.utils
```

All the code demonstrated in this document assumes that the following modules or packages are
imported.

```python
>>> from mxnet import gluon
>>> from mxnet import nd
>>> from mxnet.contrib import text
>>> import collections

```

### Looking up pre-trained word embeddings for indexed words

As a common use case, let us look up pre-trained word embedding vectors for indexed words in just a
few lines of code. 

To begin with, Suppose that we have a simple text data set in the string format. We can count
word frequency in the data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)

```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for all the keys in `counter` and load the defined fastText
word embedding for all such indexed words. First, we need a Vocabulary object with `counter` as its
argument

```python
>>> my_vocab = text.vocab.Vocabulary(counter)

```

We can create a fastText word embedding object by specifying the embedding name `fasttext` and
the pre-trained file `wiki.simple.vec`. We also specify that the indexed tokens for loading the
fastText word embedding come from the defined Vocabulary object `my_vocab`.

```python
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
...     vocabulary=my_vocab)

```

Now we are ready to look up the fastText word embedding vectors for indexed words, such as 'hello'
and 'world'.

```python
>>> my_embedding.get_vecs_by_tokens(['hello', 'world'])

[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
    ...
   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
 [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
    ...
   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

### Using pre-trained word embeddings in `gluon`

To demonstrate how to use pre-trained word embeddings in the `gluon` package, let us first obtain
indices of the words 'hello' and 'world'.

```python
>>> my_embedding.to_indices(['hello', 'world'])
[2, 1]

```

We can obtain the vector representation for the words 'hello' and 'world' by specifying their
indices (2 and 1) and the `my_embedding.idx_to_vec` in `mxnet.gluon.nn.Embedding`.
 
```python
>>> layer = gluon.nn.Embedding(len(my_embedding), my_embedding.vec_len)
>>> layer.initialize()
>>> layer.weight.set_data(my_embedding.idx_to_vec)
>>> layer(nd.array([2, 1]))

[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
    ...
   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
 [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
    ...
   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

## Vocabulary

The vocabulary builds indices for text tokens. Such indexed tokens can be used by token embedding
instances. The input counter whose keys are candidate indices may be obtained via
[`count_tokens_from_str`](#mxnet.contrib.text.utils.count_tokens_from_str).


```eval_rst
.. currentmodule:: mxnet.contrib.text.vocab
.. autosummary::
    :nosignatures:

    Vocabulary
```

Suppose that we have a simple text data set in the string format. We can count word frequency in the
data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)

```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the 2 most frequent keys in `counter` with the unknown
token representation '&lt;unk&gt;' and a reserved token '&lt;pad&gt;'.

```python
>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='&lt;unk&gt;', 
...     reserved_tokens=['&lt;pad&gt;'])

```

We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping
indices to tokens), `vec_len` (length of each embedding vector), and `unknown_token` (representation
of any unknown token) and `reserved_tokens`.


```python
>>> my_vocab.token_to_idx
{'&lt;unk&gt;': 0, '&lt;pad&gt;': 1, 'world': 2, 'hello': 3}
>>> my_vocab.idx_to_token
['&lt;unk&gt;', '&lt;pad&gt;', 'world', 'hello']
>>> my_vocab.unknown_token
'&lt;unk&gt;'
>>> my_vocab.reserved_tokens
['&lt;pad&gt;']
>>> len(my_vocab)
4
```

Besides the specified unknown token '&lt;unk&gt;' and reserved_token '&lt;pad&gt;' are indexed, the 2 most
frequent words 'world' and 'hello' are also indexed.




## Text token embedding

To load token embeddings from an externally hosted pre-trained token embedding file, such as those
of GloVe and FastText, use
[`embedding.create(embedding_name, pretrained_file_name)`](#mxnet.contrib.text.embedding.create).

To get all the available `embedding_name` and `pretrained_file_name`, use
[`embedding.get_pretrained_file_names()`](#mxnet.contrib.text.embedding.get_pretrained_file_names).

```python
>>> text.embedding.get_pretrained_file_names()
{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...],
'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]}

```

Alternatively, to load embedding vectors from a custom pre-trained text token
embedding file, use [`CustomEmbedding`](#mxnet.contrib.text.embedding.CustomEmbedding).

Moreover, to load composite embedding vectors, such as to concatenate embedding vectors,
use [`CompositeEmbedding`](#mxnet.contrib.text.embedding.CompositeEmbedding).

The indexed tokens in a text token embedding may come from a vocabulary or from the loaded embedding
vectors. In the former case, only the indexed tokens in a vocabulary are associated with the loaded
embedding vectors, such as loaded from a pre-trained token embedding file. In the later case, all
the tokens from the loaded embedding vectors, such as loaded from a pre-trained token embedding
file, are taken as the indexed tokens of the embedding.


```eval_rst
.. currentmodule:: mxnet.contrib.text.embedding
.. autosummary::
    :nosignatures:

    register
    create
    get_pretrained_file_names
    GloVe
    FastText
    CustomEmbedding
    CompositeEmbedding
```


### Indexed tokens are from a vocabulary

One can specify that only the indexed tokens in a vocabulary are associated with the loaded
embedding vectors, such as loaded from a pre-trained token embedding file.

To begin with, suppose that we have a simple text data set in the string format. We can count word
frequency in the data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)

```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the most frequent 2 keys in `counter` and load the defined
fastText word embedding with pre-trained file `wiki.simple.vec` for all these 2 words. 

```python
>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2)
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
...     vocabulary=my_vocab)

```

Now we are ready to look up the fastText word embedding vectors for indexed words.

```python
>>> my_embedding.get_vecs_by_tokens(['hello', 'world'])

[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
    ...
   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
 [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
    ...
   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

We can also access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token`
(mapping indices to tokens), and `vec_len` (length of each embedding vector).

```python
>>> my_embedding.token_to_idx
{'&lt;unk&gt;': 0, 'world': 1, 'hello': 2}
>>> my_embedding.idx_to_token
['&lt;unk&gt;', 'world', 'hello']
>>> len(my_embedding)
3
>>> my_embedding.vec_len
300

```

If a token is unknown to `glossary`, its embedding vector is initialized according to the default
specification in `fasttext_simple` (all elements are 0).

```python

>>> my_embedding.get_vecs_by_tokens('nice')

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  ...
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<NDArray 300 @cpu(0)>

```


### Indexed tokens are from the loaded embedding vectors

One can also use all the tokens from the loaded embedding vectors, such as loaded from a pre-trained
token embedding file, as the indexed tokens of the embedding.

To begin with, we can create a fastText word embedding object by specifying the embedding name
'fasttext' and the pre-trained file 'wiki.simple.vec'. The argument `init_unknown_vec` specifies
default vector representation for any unknown token. To index all the tokens from this pre-trained
word embedding file, we do not need to specify any vocabulary.

```python
>>> my_embedding = text.embedding.create('fasttext', pretrained_file_name='wiki.simple.vec',
...     init_unknown_vec=nd.zeros)

```

We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping
indices to tokens), `vec_len` (length of each embedding vector), and `unknown_token` (representation
of any unknown token, default value is '&lt;unk&gt;').

```python
>>> my_embedding.token_to_idx['nice']
2586
>>> my_embedding.idx_to_token[2586]
'nice'
>>> my_embedding.vec_len
300
>>> my_embedding.unknown_token
'&lt;unk&gt;'

```

For every unknown token, if its representation '&lt;unk&gt;' is encountered in the pre-trained token
embedding file, index 0 of property `idx_to_vec` maps to the pre-trained token embedding vector
loaded from the file; otherwise, index 0 of property `idx_to_vec` maps to the default token
embedding vector specified via `init_unknown_vec` (set to nd.zeros here). Since the pre-trained file
does not have a vector for the token '&lt;unk&gt;', index 0 has to map to an additional token '&lt;unk&gt;' and
the number of tokens in the embedding is 111,052.


```python
>>> len(my_embedding)
111052
>>> my_embedding.idx_to_vec[0]

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  ...
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
<NDArray 300 @cpu(0)>
>>> my_embedding.get_vecs_by_tokens('nice')

[ 0.49397001  0.39996001  0.24000999 -0.15121    -0.087512    0.37114
  ...
  0.089521    0.29175001 -0.40917999 -0.089206   -0.1816     -0.36616999]
<NDArray 300 @cpu(0)>
>>> my_embedding.get_vecs_by_tokens(['unknownT0kEN', 'unknownT0kEN'])

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   ...
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   ...
   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
<NDArray 2x50 @cpu(0)>

```


### Implement a new text token embedding

For ``optimizer``, create a subclass of `mxnet.contrib.text.embedding._TokenEmbedding`.
Also add ``@mxnet.contrib.text.embedding._TokenEmbedding.register`` before this class. See
[`embedding.py`](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/text/embedding.py)
for examples.


## Text utilities

The following functions provide utilities for text data processing.

```eval_rst
.. currentmodule:: mxnet.contrib.text.utils
.. autosummary::
    :nosignatures:

    count_tokens_from_str
```


## API Reference

<script type="text/javascript" src='../../../_static/js/auto_module_index.js'></script>

```eval_rst

.. automodule:: mxnet.contrib.text.embedding
    :members: register, create, get_pretrained_file_names
.. autoclass:: mxnet.contrib.text.embedding.GloVe
    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.FastText
    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.CustomEmbedding
    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
.. autoclass:: mxnet.contrib.text.embedding.CompositeEmbedding
    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens

.. automodule:: mxnet.contrib.text.vocab
.. autoclass:: mxnet.contrib.text.vocab.Vocabulary
    :members: to_indices, to_tokens

.. automodule:: mxnet.contrib.text.utils
    :members: count_tokens_from_str

```
<script>auto_index("api-reference");</script>