Skip to content

Commit 1d25dd9

Browse files
authored
New arguments to some methods of dlm.ldl.LDL & Updates the documentation. (#9)
* Fixes tests * Adds the 'count' argument to dlm.ldl.LDL.gen_cmat. * Adds the dependency on netcdf4 in pyproject.toml * Adds a new argument 'mats' to dlm.ldl.LDL.save_matrices for saving matices selectively * Fixes the generation of C-hat and S-hat in docs (quickstart.rst) * Fixes pyldl to discriminative_lexicon_model in docs * Adds a docstring to dlm.performance.accuracy * Adds dlm.ldl.LDL.accuracy * Updates .gitignore to exclude notes/ * Adds a new arg 'suffix' to dlm.mapping.load_mat_from_csv * Updates dlm.mapping.load_csv to handle a gz file * Fixes a typo in dlm.mapping.load_csv * Adds a documentaion for incremental learning
1 parent 7e0c015 commit 1d25dd9

File tree

3 files changed

+180
-6
lines changed

3 files changed

+180
-6
lines changed

discriminative_lexicon_model/mapping.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import numpy as np
66
import xarray as xr
77
import fasttext as ft
8+
import gzip
89
from tqdm import tqdm
910

1011
from . import mapping as lmap
@@ -535,7 +536,7 @@ def save_mat_as_csv (mat, directory='.', stem='mat', add=''):
535536
np.savetxt(path_dim1, vals_dim1, fmt='%s', delimiter='\t', comments=None)
536537
return None
537538

538-
def load_mat_from_csv (directory, stem, add=''):
539+
def load_mat_from_csv (directory, stem, add='', suffix='.csv'):
539540
"""
540541
Loads the csv files that are assumed to have been saved by save_mat_as_csv.
541542
@@ -555,13 +556,15 @@ def load_mat_from_csv (directory, stem, add=''):
555556
the files 'foobar_main_X.csv', 'foobar_xxx_X.csv' where 'xxx' is the
556557
name of the first dimension, 'foobar_yyy_X.csv' where 'yyy' is the name
557558
of the second dimension, and 'foobar_meta_X.csv'.
559+
suffix : str
560+
The file extension. As default, it is assumed to be ".csv". You can set
561+
it to ".csv.gz" if the output by save_mat_as_csv is compressed by gzip.
558562
559563
Returns
560564
-------
561565
mat : xarray.core.dataarray.DataArray
562566
An xarray matrix, reconstructed from the csv files being loaded.
563567
"""
564-
suffix = '.csv'
565568
name_main = 'main'
566569
name_meta = 'meta'
567570
path_main = '{}/{}_{}{}{}'.format(directory, stem, name_main, add, suffix)
@@ -595,10 +598,12 @@ def load_csv (path):
595598
csv : list
596599
A list of dimension values.
597600
"""
598-
with open(path, 'r') as f:
599-
csv = f.readlines()
601+
if path[-3:]=='.gz':
602+
with gzip.open(path, 'rt') as f:
603+
csv = f.readlines()
604+
else:
605+
with open(path, 'r') as f:
606+
csv = f.readlines()
600607
csv = [ i.rstrip('\n') for i in csv ]
601608
return csv
602609

603-
604-

docs/source/incremental.rst

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
====================
2+
Incremental learning
3+
====================
4+
5+
6+
---------------------------------------
7+
Incremental learning by a list of words
8+
---------------------------------------
9+
Weight matrices in LDL (i.e., `\mathbf{F}` and `\mathbf{G}`) can be estimated also step by step, which is called the *incremental* learning. For a simple example, suppose we have only two words "a" and "an" in the lexicon and we encounter them in the order of "a", "a", "an", and "a". This can be done by discriminative_lexicon_model.mapping.incremental_learning. The first argument of the function is a series of learning events.
10+
11+
.. code-block:: python
12+
13+
>>> import xarray as xr
14+
>>> import discriminative_lexicon_model.mapping as pm
15+
>>> cmat = pm.gen_cmat(['a', 'an'], gram=2)
16+
>>> print(cmat)
17+
18+
<xarray.DataArray (word: 2, cues: 4)> Size: 64B
19+
array([[1, 1, 0, 0],
20+
[1, 0, 1, 1]])
21+
Coordinates:
22+
* word (word) <U2 16B 'a' 'an'
23+
* cues (cues) <U2 32B '#a' 'a#' 'an' 'n#'
24+
25+
>>> smat = xr.DataArray([[0.9, -0.2, 0.1], [0.1, 0.9, -0.2]], dims=('word','semantics'), coords={'word':['a','an'], 'semantics':['S1','S2','S3']})
26+
>>> print(smat)
27+
28+
<xarray.DataArray (word: 2, semantics: 3)> Size: 48B
29+
array([[ 0.9, -0.2, 0.1],
30+
[ 0.1, 0.9, -0.2]])
31+
Coordinates:
32+
* word (word) <U2 16B 'a' 'an'
33+
* semantics (semantics) <U2 24B 'S1' 'S2' 'S3'
34+
35+
>>> fmat = pm.incremental_learning(['a', 'a', 'an', 'a'], cmat, smat)
36+
>>> print(fmat)
37+
38+
<xarray.DataArray (cues: 4, semantics: 3)> Size: 96B
39+
array([[ 0.21402, 0.03544, 0.00478],
40+
[ 0.22022, -0.05816, 0.02658],
41+
[-0.0062 , 0.0936 , -0.0218 ],
42+
[-0.0062 , 0.0936 , -0.0218 ]])
43+
Coordinates:
44+
* cues (cues) <U2 32B '#a' 'a#' 'an' 'n#'
45+
* semantics (semantics) <U2 24B 'S1' 'S2' 'S3'
46+
47+
Note that the `\mathbf{S}` matrix is set up, so that the first dimension "S1" is strongly correlated with "a" while "S2" is correlated "an". In other words, you can conceptually interpret "S1" as the core meaning of "a" and "S2" as that of "an". In the weight matrix (i.e., `\mathbf{F}`), the first two rows, namely the cues "#a" and "a#" are strongly correlated with the first column, namely "S1". The last two rows, namely the cues "an" and "n#" are strongly correlated with the second column, namely "S2". The associations of "an" and "n#" to "S2" are numerically smaller than those of "#a" and "a#" to "S1", because "an" occurs only once while "a" occurs three times in the learning events.
48+
49+
As shown below, after a sufficient number of learning events, the estimates approximate those by the *endstate* learning.
50+
51+
.. code-block:: python
52+
53+
>>> import pandas as pd
54+
>>> words = pd.Series(['a', 'an']).sample(1000, replace=True, random_state=518).tolist()
55+
>>> fmat_inc = pm.incremental_learning(words, cmat, smat)
56+
>>> fmat_end = pm.gen_fmat(cmat=cmat, smat=smat)
57+
>>> print(fmat_inc)
58+
59+
<xarray.DataArray (cues: 4, semantics: 3)> Size: 96B
60+
array([[ 3.80000000e-01, 1.00000000e-01, -5.65948715e-19],
61+
[ 5.20000000e-01, -3.00000000e-01, 1.00000000e-01],
62+
[-1.40000000e-01, 4.00000000e-01, -1.00000000e-01],
63+
[-1.40000000e-01, 4.00000000e-01, -1.00000000e-01]])
64+
Coordinates:
65+
* cues (cues) <U2 32B '#a' 'a#' 'an' 'n#'
66+
* semantics (semantics) <U2 24B 'S1' 'S2' 'S3'
67+
68+
>>> print(fmat_end)
69+
<xarray.DataArray (cues: 4, semantics: 3)> Size: 96B
70+
array([[ 3.80000000e-01, 1.00000000e-01, -2.77555756e-17],
71+
[ 5.20000000e-01, -3.00000000e-01, 1.00000000e-01],
72+
[-1.40000000e-01, 4.00000000e-01, -1.00000000e-01],
73+
[-1.40000000e-01, 4.00000000e-01, -1.00000000e-01]])
74+
Coordinates:
75+
* cues (cues) <U2 32B '#a' 'a#' 'an' 'n#'
76+
* semantics (semantics) <U2 24B 'S1' 'S2' 'S3'
77+
78+
>>> print(fmat_inc.round(10).identical(fmat_end.round(10)))
79+
True
80+
81+
Note that the order of learning events does matter in the incremental learning. Compare the following two examples.
82+
83+
.. code-block:: python
84+
85+
>>> import numpy as np
86+
words_a_first = np.repeat(['a', 'an'], [10, 10])
87+
words_an_first = np.repeat(['an', 'a'], [10, 10])
88+
fmat_a_first = pm.incremental_learning(words_a_first, cmat, smat)
89+
fmat_an_first = pm.incremental_learning(words_an_first, cmat, smat)
90+
print(fmat_a_first)
91+
<xarray.DataArray (cues: 4, semantics: 3)> Size: 96B
92+
array([[ 0.30396166, 0.23117687, -0.03460906],
93+
[ 0.40168162, -0.08926258, 0.04463129],
94+
[-0.09771995, 0.32043945, -0.07924035],
95+
[-0.09771995, 0.32043945, -0.07924035]])
96+
Coordinates:
97+
* cues (cues) <U2 32B '#a' 'a#' 'an' 'n#'
98+
* semantics (semantics) <U2 24B 'S1' 'S2' 'S3'
99+
100+
print(fmat_an_first)
101+
<xarray.DataArray (cues: 4, semantics: 3)> Size: 96B
102+
array([[ 0.41961651, 0.07215146, 0.0087615 ],
103+
[ 0.38722476, -0.21937428, 0.073545 ],
104+
[ 0.03239175, 0.29152574, -0.0647835 ],
105+
[ 0.03239175, 0.29152574, -0.0647835 ]])
106+
Coordinates:
107+
* cues (cues) <U2 32B '#a' 'a#' 'an' 'n#'
108+
* semantics (semantics) <U2 24B 'S1' 'S2' 'S3'
109+
110+
In the first case, where "a" is encountered first for 100 times before "an" is encountered 100 times consecutively, the estimated associations are "biased" towards to "an". This can be seen, for example, in the cell value of the first row and the second column, namely the association strength between "#a" and "S2". Note that the equilibrium of this association is 0.10 (see the example above for "fmat_end"). Since "an" is encountered many times more "recently", such recent learning events have bigger effects.
111+
112+
In contrast, in the latter case, where "an" is encountered first for 100 times before "a" is encountered 100 times, the association from "#a" to "S1" is much bigger than that from "#a" to "S2". Note that the equilibrium of the association from "#a" to "S1" is 0.38 (from "fmat_end" in the example above). Since "a" is encountered many times towards the end of learning, the weights are biased towards "a".
113+
114+
115+
----------------------------------------------
116+
Incremental learning by a list of word indices
117+
----------------------------------------------
118+
Learning events (i.e., which words to encounter) can be specified by indices of words as well. This can be useful when the `\mathbf{C}` and/or `\mathbf{S}` matrices contain duplicated word labels. Duplicated rows can be an issue when word tokens are involved. Consider the following example:
119+
120+
.. code-block:: python
121+
122+
>>> import xarray as xr
123+
>>> import discriminative_lexicon_model.mapping as pm
124+
>>> cmat = pm.gen_cmat(['a', 'an', 'an'], gram=2)
125+
>>> smat = xr.DataArray([[0.9, -0.2, 0.1], [0.1, 0.9, -0.2], [0.2, 0.8, -0.1]], dims=('word','semantics'), coords={'word':['a','an','an'], 'semantics':['S1','S2','S3']})
126+
>>> print(cmat)
127+
128+
<xarray.DataArray (word: 3, cues: 4)> Size: 96B
129+
array([[1, 1, 0, 0],
130+
[1, 0, 1, 1],
131+
[1, 0, 1, 1]])
132+
Coordinates:
133+
* word (word) <U2 24B 'a' 'an' 'an'
134+
* cues (cues) <U2 32B '#a' 'a#' 'an' 'n#'
135+
136+
>>> print(smat)
137+
<xarray.DataArray (word: 3, semantics: 3)> Size: 72B
138+
array([[ 0.9, -0.2, 0.1],
139+
[ 0.1, 0.9, -0.2],
140+
[ 0.2, 0.8, -0.1]])
141+
Coordinates:
142+
* word (word) <U2 24B 'a' 'an' 'an'
143+
* semantics (semantics) <U2 24B 'S1' 'S2' 'S3'
144+
145+
Note that the word type "an" has two rows. Its form vectors are the same (i.e., the second and third rows of the `\mathbf{C}` matrix), while its semantic vectors are slightly different (i.e., the second and third rows of the `\mathbf{S}` matrix). You can view the different semantic vectors as different meanings of the same word in different contexts. In such a case like this, specifying learning events by a list of words like below would raise "InvalidIndexError", because the function cannot determine which semantic vector to use for "an" in this case.
146+
147+
.. code-block:: python
148+
149+
>>> fmat = pm.incremental_learning(['a', 'a', 'an', 'a'], cmat, smat)
150+
>>> # This raises an error.
151+
152+
Instead, you need to specify learning events in terms of indices of the words. For this purpose, discriminative_lexicon_model.mapping.incremental_learning_byind can be used:
153+
154+
.. code-block:: python
155+
156+
>>> events = [0, 0, 1, 2, 2] # 'a', 'a', 'an' (2nd row), 'an' (3rd row), 'an' (3rd row)
157+
>>> fmat = pm.incremental_learning_byind(events, cmat, smat)
158+
>>> print(fmat)
159+
160+
<xarray.DataArray (cues: 4, semantics: 3)> Size: 96B
161+
array([[ 0.165422, 0.151984, -0.012742],
162+
[ 0.162 , -0.036 , 0.018 ],
163+
[ 0.003422, 0.187984, -0.030742],
164+
[ 0.003422, 0.187984, -0.030742]])
165+
Coordinates:
166+
* cues (cues) <U2 32B '#a' 'a#' 'an' 'n#'
167+
* semantics (semantics) <U2 24B 'S1' 'S2' 'S3'
168+

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,4 @@
1010
mapping
1111
measures
1212
performance
13+
incremental

0 commit comments

Comments
 (0)