2015年7月23日 星期四

統計方法-課程大綱


統計方法-課程大綱
週次
Week
進度說明 Progress Description
1        
Introduction
介紹
2        
Probability ()
可能性/機率()
3        
Probability ()
可能性/機率()
4        
Sampling and Sampling Distribution
抽樣與抽樣分佈
5        
Interval Estimation 
區間估計
6        
Midterm exam (in the computer laboratory) 
期中考試(在計算機實驗室)
7        
Hypothesis Testing (假設檢驗(Ⅰ)
8        
Hypothesis Testing (假設檢驗(Ⅱ)
9        
Test of Independence and Goodness of Fit
獨立和善良的測試的飛度
10    
ANOVA and Experimental Design()
方差分析和實驗設計(Ⅰ)
11    
ANOVA and Experimental Design()
方差分析和實驗設計(Ⅱ)
12    
ANOVA and Experimental Design()
方差分析和實驗設計(Ⅲ)
13    
Midterm exam (in the computer laboratory)
期中考試(在計算機實驗室)
14    
Regression Analysis(Ⅰ) 
回歸分析(Ⅰ)
15    
Regression Analysis(Ⅱ) 
回歸分析(Ⅱ)
16    
Regression Analysis(Ⅲ) 
回歸分析(Ⅲ)
17    
Regression Analysis(Ⅳ) 
回歸分析(Ⅳ)
18    
Final exam (in the computer laboratory)
期末考試(在計算機實驗室)
※以上每週進度教師可依上課情況做適度調整

2015年5月15日 星期五

DATA MINING 資料探勘文獻

資料探勘文獻

期刊與研討會

IEEE Transactions on Knowledge and Data Engineering (TKDE)
Journal of Data Mining and Knowledge Discovery (JDMKD)
Journal of Very Large Database Systems (JVLDS)
Journal of Visual Language and Computing (JVLC)
Journal of Intelligent Information Systems (JIIS)
Journal of Intelligent Data Analysis (JIDA)
Data and Knowledge Engineering (DKE)
Machine Learning (ML)
ACM SIGMOD Record (SIGMODR)
ACM Int’l Conf. on Management of Data (ICMOD)
IEEE Int’l Conf. on Data Engineering (ICDE)
IEEE Int’l Conf. on Information Visualization (ICIV)
Int’l Conf. on Knowledge Discovery and Data Mining (ICKDD)
Int’l Conf. on Very Large Databases (ICVLDB)
Int’l Conf. on Information and Knowledge Management (CIKM)
Int’l Symp. on Methodologies for Intelligent Systems (ISMIS)
Conference on Machine Learning

網路資源

Ÿ           http://www.acm.org/sigkdd/
Ÿ           http://www.lib.iastate.edu/
Iowa State University have made a systemaic effort to identify and acquire the more important monographs and conference proceedings on Data Mining and Knowledge Dicovery in Databases. Select 'Library Catalog' and search 'data mining' or 'knowledge discovery' in the keyword [General Keyword] search.
data mining group has much resources
Ÿ           The collection of Computer Science bibliographies
Lewis, D., 1997. The reuters-21578, distribution 1.0
Ÿ           http://www.ulb.ac.be/di/bookmarks/book.html#cs

概論

Fayyad, U., “From data mining to knowledge discovery: an overview”, Advances in KDD
Brachman, R., “Mining business databases”, Communication of ACM, Nov. 1996
Simoudis, E., “Reality check for data mining”, IEEE EXPERT, Oct. 1996
Fayyad, U., “Knowledge discovery and data mining: towards a unifying framework”, KDD96
Piatetsky-Shapiro, G., “An overview of issues in developing industrial data mining and knowledge discovery applications”, ICKDD 96
Mitchell, T., Machine Learning, McGraw-Hill, 1997

資料探勘應用

John, G., “Stock selection using rule induction”, IEEE EXPERT, Oct. 1996
Dao, S.“Applying a data miner to heterogeneous schema integration”, KDD95
Dzeroski, S., “Knowledge discovery in a water quality database”, KDD95
Ezawa, K., “Knowledge discovery in telecommunication services data using Bayesian network models”, KDD95
Feelders, A., “Data mining for loan evaluation at ABN AMRO: a case study”, KDD95
Sanjeev, A., “Discovering enrollment knowledge in university databases”, KDD95
Tsumoto, S., “Automated discovery of functional components of proteins from Amino-Acid sequences based on rough sets and change of representation”, KDD95
Fitzsimons, M., “The application of rule induction and neural networks for television audience prediction”, Proc. of ESOMAR/EMAC/AFM symposium on information based decision making in marketing, 1993, pp.69-82
Schmitz, J., “CoverStory – automated news finding in marketing”, DSS Transactions, ed. L. Volino, 46-54. Providence, R.I.: Institute of Management Sciences
Anand, T., “Opportunity explorer: navigating large databases using knowledge discovery templates”, JIIS 4(1): 27-38
Hall, J., “Applying computational intelligence to the investment process”, Proc. of CIFER-96: computational intelligence in financial engineering, IEEE Press
Senator, T., “The financial crimes enforcement network AI system (FAIS)”, AI magazine, winter 1995, 21-39
Davis, A., “Management of cellular fraud: knowledge-based detection, classification and prevention”, Proc. of 13th Int. Conf. on AI, expert systems and natural language, v2, p.155-164
Data mining applications section in KDD96

網際網路資料探勘

Carbonell, J., “Learning from the WEB”, ISMIS 97
Chen, M.-S., “Data mining for path traversal patterns in a web environment”, Int’l Conf. On Distributing Computing Systems, 1996 (COMPENDEX 91~)
Etzioni, O., “The World-Wide Web: quagmire or gold mine?”, CACM, v.39, no.11, 1996
Hsu, Y.-J. and Wen-Tan Yih, (of Taiwan U.) “Template-based information mining from HTML documents”, Proc. of 14th National Conf. on A.I., 1997
Soderland, S. “Learning to extract text-based information from the world wide web”, ICKDD 97
Zaiane, O., “Resource and knowledge discovery in global information systems: a preliminary design and experiment”, KDD95
Zamir, O., “Fast and intuitive clustering of web documents”, ICKDD 97
IPO Keywords: world wide web AND information retrieval

文件資料探勘

Soderland, S. “Learning to extract text-based information from the world wide web”, ICKDD 97
Hahn, U., “Deep knowledge mining from natural language text sources”, (CIKM97)
Feldman, R., “Knowledge discovery in textual databases”, KDD95
Feldman, R., “Mining associations in text in the presence of background knowledge”, KDD96
Feldman, R., “Document explorer: discovering knowledge in document collections”, ISMIS 97
Zari, G., “Conceptual modeling of the “meaning” of textual narrative documents”, ISMIS 97
Esposito, F., “Knowledge revision for document understanding”, ISMIS 97
Reuters-22173 corpus: a collection of 22,173 indexed documents appearing on the Reuters newswire in 1987; Reuters Ltd, Carnegie Group, David Lewis, Information Retrieval Laboratory at the University of Massachusetts; available via ftp from: ciir-ftp.cs.umass.edu:/pub/reuters1/corpus.tar.Z.
簡立峰,中研院資科所中文資訊處理實驗室:Csmart系統

多媒體資料庫資料探勘

Ester, M., “A database interface for clustering in large spatial databases”, KDD95
Li, C., “Knowledge-based scientific discovery in geological databases”, KDD95
Stolorz, P., “Fast spatio-temporal data mining of large geophysical datasets”, KDD95
Knorr, E., “Extraction of spatial proximity patterns by concept generalization”, KDD96
Padmanabhan, B., “Pattern discovery in temporal databases: a temporal logic approach”, KDD96
Czyzewski, A., “Mining knowledge in noisy audio data”, KDD96
Ester, M., “A density-based algorithm for discovering clusters in large spatioal databases with noise”, KDD96
Kaufman, K., “A method for reasoning with stuctured and continuous attributes in the INLEN-2 multistrategy knowledge discovery system”, KDD96
Lagus, K., “Self-organizing maps of document collections: a new approach to interactive exploration”, KDD96

關連法則

Holsheimer, M., “A perspective on databases and data mining”, KDD95
Feldman, R., “Mining associations in text in the presence of background knowledge”, KDD96
Cheung, D., “Maintenance of discovered knowledge: a case in multi-level association rules”, KDD96
Agrawal, R., “Mining association rules between sets of items in large databases”, ICMOD 1993
Agrawal, R., “Fast algorithms for mining association rules”, ICVLDB 94
Savasere, A., “An efficient algorithm for mining association rules in large databases’, ICVLDB 95
Srikant, R., “Mining quantitative association rules in large relational tables”, ICMOD 96
Fukuda, T., “Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization”, ICMOD 96
Brin, S., “Dynamic itemset counting and implication rules for market basket data”, ICMOD 97
Brin, S., “Beyond market baskets: generalizing association rules to correlations”, ICMOD 97
Han, E.-H., “Scalable parallel data mining for association rules”, ICMOD 97
Lent, B, “Clustering association rules”, ICDE 97
Park, J., “Mining association rules with adjustable accuracy”, CIKM 97
Singh, L., “Generating association rules from semi-structured documents using a concept hierarchy”, CIKM 97

時間序列

Mannila, H., “Discovering frequent episodes in sequences”, KDD95
Mannila, H., “Discovering generalized episodes using minimal occurrences”, KDD96
Mannila, H., “Rule discovery from time series”, KDD98
Agrawal, R. ``Efficient Similarity Search in Sequence Databases'', 4th Int'l Conf. on Foundations of Data Organization and Algorithms, 1993
Agrawal, R. ``Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases'', 21st Int'l Conf. on VLDB, 1995.
Agrawal, R. ``Mining Sequential Patterns'', Int'l Conf. on Data Engineering, 1995.
Agrawal, R., “Querying shapes of histories”, VLDB95 Proc.
Berndt, D., “Finding patterns in time series: a dynamic programming approach”, Advances in KDD, 1996
Goldin, D., “On similarity queries for time-series data: constraint specification and implementation”, 1st int’l conf. on the principles and practice of constraint programming, LNCS 976, Sept. 1995
Jagadish, H., “Similarity-based queries”, PODS95 Proc.
Keogh, E., “A probabilistic approach to fast pattern matching in time series databases”, KDD97
Laird, P., “Identifying and using patterns in sequential data”, 4th Int’l Workshop on Algorithmic Learning Theory, 1993, Springer-Verlag, pp.1-18
Lent, B., “Discovering trends in text databases”, KDD97 Proc.
Rafiei, D., “Similarity-based queries for time series data”, SIGMOD97 Proc.
Shim, K. "High-dimensional Similarity Joins", 13th Int'l Conf. on Data Engineering, 1997.
Srikant, R. ``Mining Sequential Patterns: Generalizations and Performance Improvements'', Fifth Int'l Conf. on Extending Database Technology, 1996. 

Visualization and Data Exploration

Brunk, C., “MineSet: an integrated system for data mining”, ICKDD 97
Catarci, T., “Visual query systems for databases: a survey”, JVLC 97
Derthick, M., “An interactive visualization environment for data exploration”, ICKDD 97
Feldman, R., “Visualization techniques to explore data mining results for document collections”, ICKDD 97
Gebhardt, M., “A toolkit for negotiation support interfaces to multi-dimensional data”, ICMOD97
Hee, H.-Y., “Visualization support for data mining”, IEEE EXPERT, Oct. 1996
Livny, M., “DEVise: integrated querying and visual exploration of large datasets”, ICMOD97
Mihalisin, T., “Fast robust visual data mining”, ICKDD97
Rao, S., “Providing better support for a class of decision support queries”, ICMOD96
Roth, S., “Visage: a usr interface environment for exploring information”, ICIV 96
Selfridge, P., “IDEA: interactive data exploration and analysis”, ICMOD 96
Ahlberg, C., “Spotfire: an information exploration environment”, SIGMODR v25 n4, Dec. 96
Kennedy, J., “A framework for information visualization”, SIGMODR v25 n4, Dec. 96
Keim, D. “Pixel-oriented database visualizations”, SIGMODR v25 n4, Dec. 96
Ioannidis, Y., “Dynamic information visualization”, SIGMODR v25 n4, Dec. 96
Hasan, M., “Applying database visualization to the world wide web”, SIGMODR v25 n4, Dec. 96

OLAP, Data Cube, and Data Warehousing

Chaudhuri, S., “An overview of data warehousing and OLAP technology”, SIGMODR, March, 97
Colliat, G., “OLAP, relational, and multidimensional database systems”, SIGMODR, Sept. 1996
Gray, J., “Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals”, JDMKD 97
Harinarayan, V., “Implementing data cubes efficiently”, ICMOD 96
Ho, C.-T., “Range queries in OLAP data cubes”, ICMOD 97
Roussopoulos, N., “Cubetree: organization of and bulk updates on the data cube”, ICMOD97
Mumick, I., “Maintenance of data cubes and summary tables in a warehouse”, ICMOD97
Agrawal, R., “Modeling multidimensional databases”, ICDE 97
Gupta, H., “Index selection for OLAP”, ICDE 97
Labio, W., “Physical database design for data warehouses”, ICDE 97
Gyssens, M., “A foundation for multi-dimensional databases”, ICVLDB 97
Ross, K., “Fast computation of sparse datacubes”, ICVLDB 97

Clustering


Similarity

Weber, R. et al., A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces, Int’l Conf. on VLDB, 1998.

Measures of interestingness

Kamber, M., “Evaluating the interestingness of characteristic rules”, KDD96
Silberschatz, A., “On subjective measures of interestingness in knowledge discovery”, KDD95
Suzuki, E., “Exceptional knowledge discovery in database based on information theory”, KDD96

知識表示與資料探勘

人工智慧、專家系統教科書或知識表示法專書
Aronis, J., “Exploiting background knowledge in automated discovery”, KDD96

特徵選擇

Kohavi, R., “Feature subset selection using the wrapper method: overfitting and dynamic search space topology”, KDD95
Seshadri, V., “Feature extraction for massive data mining”, KDD95
Cherkauer, K., “Growing simpler decision trees to facilitate knowledge discovery”, KDD96

Urpani, D., “RITIO - rule induction two in one”, KDD96

2015年4月11日 星期六

整體學習(Ensemble Learning)入門

(轉載)
 “監督學習(Supervised Learning)”是各種統計學習方法中最單純,最容易理解 的形式。
一般而言,監督學習的正規定義可以這樣來描述。
每筆資料點(data point) 是由一個特徵向量,我們以 表示之,和一個類別標籤(class label) 所組成;同 時,假定有一個未知(underlying)的函式 存在,對於每一筆訓練的資料點 來說, 是恆成立的。
於是學習演算法的目標就是要找一個令人滿意的近 似函式h,並使得針對任何一筆新增的特徵向量 Χ y f Χ y),( fy Χ= )( Χnew 所求得的類別標籤 可以 愈接近原始函式 計算的結果。
這個近似函式 ,我們就稱為分類器(classifier), 如此命名的原因,是因為它可以將輸入的特徵向量分發或歸類到某一個真實或接 近真實的類別。
監督學習能被應用於很多的問題上,包括手寫辨識、醫學診斷和 部分語音或文字的標籤處理。..詳文... 整體學習(Ensemble Learning)入門

群集化(clustering)

(轉載)
科學與工程技術期刊 第三卷 第一期 民國九十六年
利用共生詞彙特性發展一個二階段文件群集法
http://journal.dyu.edu.tw/dyujo/document/setjournal/s3-1-9-18.pdf

摘 要 
群集化(clustering)
是在資料探勘領域中被廣泛應用的技術,將其概念應用於文字探勘的 領域中,亦是近來的熱門研究議題。
若將群集化技術應用於文件型態的資料時,常會採用向量 空間模型(vector space model, VSM)來表達文件資料,然而在學術研究上卻發現有兩個缺失: 一為無法辨識文中詞彙間的關聯性,造成文件誤判。
在向量空間模型中,每個關鍵詞彙所構成 的維度都是獨立的,無法區別文中詞彙間的關聯性(包括一詞多義、一義多詞、以及共同發生 詞彙),使得進行文件相似度的比對時可能會造成誤判的情況,降低文件群集之品質。
另一缺 失則為如維度太高,易造成群集失準的問題。
向量空間模型的維度是由文件集所有的關鍵詞彙 之數量而定,當文件所萃取出來的關鍵字過多時,便會使得向量空間模型的維度增加,導致群 集的結果也比較不準確。
為了改善向量空間模型的兩大缺點,本文嘗試提出一個二階段的文件群集法,第一階段先 將關鍵字進行群集,第二階段再利用這些關鍵字群集將文件分群;本文透過關聯規則技術的應 用,來改善向量空間模型的缺失並增進文件群集的品質。
此外,關鍵字群集後的結果還可以幫 助文件群集作概括性的描述。本文以 Reuters-21578 文件集進行實驗評估,將本論文所提出的文 件群集法與傳統的文件群集法相比較,實驗結果證實本論文所提出的方法確實能得到高品質的 文件群集。

何謂信賴區間

統計學中,一個機率樣本信賴區間(Confidence interval)是對這個樣本的某個總體參數區間估計。信賴區間展現的是這個參數的真實值有一定機率落在測量結果的周圍的程度。信賴區間給出的是被測量參數的測量值的可信程度,即前面所要求的「一定機率」。這個機率被稱為信心水準。舉例來說,如果在一次大選中某人的支持率為55%,而信心水準0.95上的信賴區間是(50%,60%),那麼他的真實支持率有百分之九十五的機率落在百分之五十和百分之六十之間,因此他的真實支持率不足一半的可能性小於百分之2.5(假設分布是對稱的)。
如例子中一樣,信心水準一般用百分比表示,因此信心水準0.95上的信賴區間也可以表達為:95%信賴區間。信賴區間的兩端被稱為置信極限。對一個給定情形的估計來說,信心水準越高,所對應的信賴區間就會越大。
對信賴區間的計算通常要求對估計過程的假設(因此屬於參數統計),比如說假設估計的誤差是成常態分佈的。
信賴區間只在頻率統計中使用。在貝葉斯統計中的對應概念是可信區間。但是可信區間和信賴區間是建立在不同的概念基礎上的,因此一般上說取值不會一樣。 置信空間表示通過計算估計值所在的區間。 信心水準表示準確值落在這個區間的機率。 信賴區間表示具體值範圍,信心水準是個機率值。例如:估計某件事件完成會在10~12日之間,但這個估計準確性大約只有80%:表示信賴區間(10,12),信心水準80%。要想提高信心水準,就要放寬置信空間。

置信度(摘自Bai du百科)

(轉載)置信度
http://translate.google.com.tw/translate?hl=zh-TW&sl=zh-CN&u=http://baike.baidu.com/view/434404.htm&prev=search
統計學中,一個概率樣本的置信區間 (Confidence interval)是對這個樣本的某個總體參數區間估計 。置信區間展現的是這個參數的真實值有一定概率落在測量結果的周圍的程度。 置信區間給出的是被測量參數的測量值的可信程度,即前面所要求的“一定概率”。 這個概率被稱為置信水平 。

[簡介]
如果在一次大選中某人的支持率為55%,而置信水平0.95上的置信區間是(50%,60%),那麼他的真實支持率有百分之九十五的機率落在百分之五十和百分之六十之間,因此他的真實支持率不足一半的可能性小於百分之2.5(假設分佈是對稱的)。
如例子中一樣,置信水平一般用百分比表示,因此置信水平0.95上的置信區間也可以表達為:95%置信區間 置信區間的兩端被稱為置信極限 。 對一個給定情形的估計來說,置信水平越高,所對應的置信區間就會越大。
對置信區間的計算通常要求對估計過程的假設(因此屬於參數統計),比如說假設估計的誤差是成正態分佈的。
置信區間只在頻率統計中使用。 貝葉斯統計中的對應概念是可信區間 。 
但是可信區間和置信區間是建立在不同的概念基礎上的,因此一般上說取值不會一樣。 
置信空間表示通過計算估計值所在的區間。 
置信水平表示準確值落在這個區間的概率。 
置信區間表示具體值範圍,置信水平是個概率值。 
例如:估計某件事件完成會在10~12日之間,但這個估計準確性大約只有80%:
表示置信區間(10,12),置信水平80%。 要想提高置信水平,就要放寬置信空間。 [2]  
置信水平是指總體參數值落在樣本統計值某一區內的概率;而置信區間是指在某一置信水平下,樣本統計值與總體參數值間誤差範圍。 
置信區間越大,置信水平越高。

2015年4月6日 星期一

第一次登入後再改密碼

電子信箱(學校)E-mail address(school)
說明:
學校電子信箱是學校與學生重要事項聯絡的管道,完成報到手續後,請由本校首頁「成功入口」登入,即可使用本校E-mail收送信件;「成功入口」系統使用問題,請洽詢計網中心一樓諮詢服務區。

新生(98學年度入學者):
(部分新生要等完成報到手續,學籍資料建檔 後,才能登入)
本地生:【身份證號】後4+【生日】後4碼。
如:
【身分證】:A123456789
【生日】:780612
======>> 則第一次登入密碼為:67890612