CLML Read-Data

歯脱けが多いけど今はといらえずこれで。

Package

item	value
package	clml.hjs.read-data
nickname	---
file	./src/hjs/src/read-data.lisp
dependencies	---

Class

class	description
dataset	基本クラス
unspecialized-dataset	データ読み込み時のクラス
specialized-dataset	列の型が指定されたデータ
numeric-dataset	列の型を numeric で指定したデータ
category-dataset	列の型を category で指定したデータ
numeric-matrix-dataset	Dataset represented as matrix (2-dim CL array)
numeric-and-category-dataset	列の型として numeric, category が混在するデータ
numeric-matrix-and-category-dataset	Dataset specialized in both numeric (as matrix) and category values.
dimension	...

クラス図

+--------------------------+                        +-----------------+
| dataset                  |                        | dimension       |
|==========================|                        |=================|
| dimensions  simple-array |<-----------------------| name     string |
+--------------------------+                        | type     symbol |
             |                                      | index    fixnum |
             |                                      | metadata list   |
             |                                      +-----------------+
             |
             +-------------------------------+
             |                               |
             V                               V
   +---------------------+       +-----------------------+
   | specialized-dataset |       | unspecialized-dataset |
   |=====================|       |=======================|
   |---------------------|       | points   simple-array |
   +---------------------+       |-----------------------|
             |                   +-----------------------+
             |
             +---------------------------------------|-------------------------------+
             |                                       |                               |
             V                                       V                               V
   +------------------------------+  +-------------------------------+  +------------------------+
   | numeric-dataset              |  | category-dataset              |  | numeric-matrix-dataset |
   |==============================|  |===============================|  |========================|
   | numeric-points  simple-array |  | category-points  simple-array |  | numeric-points    dmat |
   |------------------------------|  |-------------------------------|  |------------------------|
   +------------------------------+  +-------------------------------+  +------------------------+
                        |                 |                |                    |
                        V                 V                V                    V
                 +------------------------------+  +-------------------------------------+
                 | numeric-and-category-dataset |  | numeric-matrix-and-category-dataset |
                 |==============================|  |=====================================|
                 |------------------------------|  |-------------------------------------|
                 +------------------------------+  +-------------------------------------+

Operator

name	description
read-data-from-file
pick-and-specialize-data
divide-dataset
choice-dimensions
choice-a-dimension
make-unspecialized-dataset
dataset-cleaning
make-bootstrap-sample-datasets

READ-DATA-FROM-FILE

read-data-from-file filename &key type external-format csv-type-spec csv-header-p missing-value-check missing-values-list
=> unspecialized-dataset

Arguments

attribute	description	type	default
filename &key		string
type	:sexp or :csv	keyword symbol
external-format		acl-external-forma
csv-type-spec	第一行は column 名かどうか	boolean	t
csv-header-p	CSV ファイルを読み込みするときの型変更 e.g. `'(string integer double-float double-float)`	list symbol
missing-value-check	欠損値検出をするかしないか	boolean	t
missing-values-list	欠損値として判断する値	list	`'(nil "" "NA")`

Values

unspecialized-dataset

Description

external-format を指定しない場合、:sexp なら :default、:csv なら :932 (ACL expression for )
CSV で読み込む場合の形式は基本的には RFC4180 に従う。
- 改行は常にデータ行が変わったと解釈するため、フィールドの値として改行をもつことはできない。

References

PICK-AND-SPECIALIZE-DATA

pick-and-specialize-data dataset &key range except data-types
=> numeric-dataset, category-dataset or numeric-and-category-dataset

Arguments

attribute	description	type	default
dataset		unspecialized-dataset
range	結果に入る列の指定、0から始まる。 e.g. `'(0 1 3 4)`	all or list integer	:all
except	:range の逆、結果に入らない列の指定、0からはじまる。e.g. `'(2)`	list integer
data-types	数値型かカテゴリ型か、その型のリスト `e.g. '(:category :numeric :numeric)`	???

Values

category-dataset と numeric-and-category-dataset は data-type によって自動的に変更する。

attribute	description	type
numeric-dataset
category-dataset
numeric-and-category-dataset

Description

References

DIVIDE-DATASET

divide-dataset specialized-d &key divide-ratio random range except
=> unspecialized-dataset, numeric-dataset, category-dataset or numeric-and-category-dataset

Arguments

attribute	description	type	default
specialized-d		unspecialized-dataset or specialized-dataset
divide-ratio	行分割の比率、nil なら行分割はしない。 e.g. `'(1 2 3)` なら行を 1:2:3 の比率に分ける。	list non-negative-integer
random	t なら行分割はランダムになる	boolean
range	結果に入る列の指定、0から始まる。 e.g. `'(0 1 3 4)`	:all or list integer	:all
except	:range の逆、結果に入らない列の指定、0からはじまる。 e.g. `'(2)`	list integer

Values

attribute	description	type

Description

データを分割する。引数 divide-ratio で行分割の比率を指定して分割する。 range, except で列を限定することもできる。
分割後の行の順番は元のデータに安定。

References

CHOICE-DIMENSIONS

choice-dimensions names data
=> vector vector

Arguments

attribute	description	type	default
names	列名のリスト	list string
data		unspecialized-dataset or specialized-dataset

Values

attribute	description	type
vector vector

Description

names で指定した名前をもつ列のデータを取り出す。

References

CHOICE-A-DIMENSION

choice-a-dimension name data
=> vector

Arguments

attribute	description	type	default
name	列名	string
data		unspecialized-dataset or specialized-dataset

Values

attribute	description	type
vector

Description

name で指定した名前をもつ列のデータを取り出す。

References

MAKE-UNSPECIALIZED-DATASET

make-unspecialized-dataset all-column-names data => unspecialized-dataset

Arguments

attribute	description	type	default
all-column-names		list string
data		vector vector

Values

attribute	description	type
		unspecialized-dataset

Description

References

DATASET-CLEANING

dataset-cleaning dataset &key interp-types interp-values-alist outlier-types outlier-values => numeric-dataset or category-dataset or numeric-and-category-dataset

Arguments

attribute	description	type	default
dataset		numeric-dataset or category-dataset or numeric-and-category-dataset
interp-types
interp-values-alist		a-list (key: 列名, datum: 補間方法(:zero :min :max :mean :median :mode :spline)) or nil
outlier-types		a-list (key: 列名, datum: 外れ値検定方法(:std-dev :mean-dev :user :smirnov-grubbs :freq)) or nil
outlier-values		a-list (key: 外れ値検定方法, datum: 検定方法に対応した値) or nil

Values

attribute	description	type
		numeric-dataset
		category-dataset
		numeric-and-category-dataset

Description

外れ値検出と欠損値補間を行なう。外れ値検出、欠損値補間の順で処理される。

外れ値検出

outlier-types-alist の key にある各列に対して、datum に指定された方法で外れ値がないか調べる。

外れ値と判定された場合は欠損値に置換される。outlier-types-alist が nil なら外れ値検出は行わない。

outlier-values-alist で、各外れ値検定方法のパラメータを指定する。指定しない場合はデフォルト値が適用される。

外れ値検定方法

数値型( :numeric )の列に対する方法 - 標準偏差(:std-dev) 平均値との差が標準偏差の n 倍より大きかった場合、外れ値とする。n がパラメータ、デフォルト値は 3 - 平均偏差(:mean-dev) 平均値との差が平均偏差の n 倍より大きかった場合、外れ値とする。n がパラメータ、デフォルト値は 3 - スミルノフ・グラッブス検定(:smirnov-grubbs) reference パラメータは有意水準を指定する。デフォルト値は 0.05 - ユーザ指定(:user) パラメータとして指定された値を外れ値とする。パラメータは必ず指定しなければならない。

カテゴリ型( :category )の列に対する方法 - 頻度(:freq) データ総数にある値（パラメータ）をかけた値を閾値として、それより少ない頻度の値を外れ値とする。パラメータのデフォルト値は 0.01 - ユーザ指定(:user) パラメータとして指定された値を外れ値とする。パラメータは必ず指定しなければならない。

欠損値補間

interp-types-alist の key にある各列に対して、datum に指定された方法で欠損値を補間する。

interp-types-alist が nil なら欠損値補間は行わない。

欠損値補間方法

数値型( :numeric )の列に対する方法ゼロ(:zero), 0 で補間する。最小値(:min), 最小値で補間する。最大値(:max), 最大値で補間する。平均値(:mean), 平均値で補間する。中央値(:median), 中央値で補間する。 3次スプライン(:spline), 3次スプライン補間を行う。 reference: William H. Press "NUMERICAL RECIPES in C", Chapter3
カテゴリ型( :category )の列に対する方法最頻値(:mode), 最も頻度の高かった値で補間する。

References

MAKE-BOOTSTRAP-SAMPLE-DATASETS

make-bootstrap-sample-datasets dataset &key number-of-datasets => list unspecialized-dataset or numeric-dataset or category-dataset or numeric-and-category-dataset

Arguments

attribute	description	type	default
dataset		unspecialized-dataset or numeric-dataset or category-dataset or numeric-and-category-dataset
number-of-datasets		positive-integer	10

Values

attribute	description	type
		list unspecialized-dataset or numeric-dataset or category-dataset or numeric-and-category-dataset

Description

number-of-datasets で指定された個数のブートストラップサンプルデータセットを作成する。

reference: C.M.ビショップ "パターン認識と機械学習上" p.22