<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6054902935917765633</id><updated>2011-07-30T15:52:15.176-07:00</updated><category term='unsupervised learning'/><category term='nlp'/><category term='this blog'/><category term='graphical model'/><category term='structured prediction'/><category term='semi-supervised learning'/><category term='cognitive science'/><category term='list'/><category term='application'/><category term='supervised learning'/><title type='text'>Apprentice of Artificial Intelligence</title><subtitle type='html'>What I learn in machine learning, artificial intelligence, and other related areas.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>25</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-8843591130711276782</id><published>2011-05-19T02:00:00.000-07:00</published><updated>2011-05-19T02:40:35.406-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>Feature selection methods not so useful for graphical model structure learning</title><content type='html'>As discussed in one of my previous posts (&lt;a href="http://ai-ml.blogspot.com/2008/10/feature-selection-and-graphical-model.html"&gt;here&lt;/a&gt;), feature selection and graphical model structure learning are highly related. On one hand, variables in the Markov blanket of a target variable can be used as a compact and effective feature set for predicting the value of the target variable; on the other hand, traditional feature selection methods can used to shed light on the local structures around a target variable (see &lt;a href="http://ai-ml.blogspot.com/2010/08/graphical-model-structure-learning.html"&gt;another post&lt;/a&gt; of mine). However, the paper "Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification" (&lt;a href="http://jmlr.csail.mit.edu/papers/v11/aliferis10a.html"&gt;Part I&lt;/a&gt;, &lt;a href="http://jmlr.csail.mit.edu/papers/v11/aliferis10b.html"&gt;Part II&lt;/a&gt;) provides extensive experimental results that support the first practice but question the second practice. The authors found that traditional feature selection methods tend to return variables that are highly predictive but are often not in the Markov blanket of the target variable and therefore not useful in structure learning.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-8843591130711276782?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/8843591130711276782/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=8843591130711276782' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/8843591130711276782'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/8843591130711276782'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2011/05/feature-selection-methods-not-so-useful.html' title='Feature selection methods not so useful for graphical model structure learning'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-2501682389601435767</id><published>2011-02-13T22:15:00.000-08:00</published><updated>2011-02-13T22:51:33.821-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cognitive science'/><title type='text'>An ET point of view of neuroscience</title><content type='html'>How would an extraterrestrial view the state of the art of our neuroscience research? The paper "&lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/20547092"&gt;Neuroscience and the correct level of explanation for understanding mind: An extraterrestrial roams through some neuroscience laboratories and concludes earthlings are not grasping how best to understand the mind–brain interface&lt;/a&gt;" (a really long title...) in Trends in Cognitive Sciences (2010 Jul; 14(7)) describes a possible point of view from ET (or more realistically, from a high-level point of view of the big picture). Specifically, the author argues that studying the low-level neural mechanisms, like the behavior of individual neurons, may not be the right way to understand mind, just like we do not need to understand quantum mechanics before understanding the Newtonian mechanics. This reminds me of the long-standing symbolism vs. connectionism debate in AI research. What is the right level of research towards an &lt;a href="http://en.wikipedia.org/wiki/Strong_AI#Artificial_General_Intelligence_research"&gt;artificial general intelligence&lt;/a&gt;? Symbols? neurons? Or somewhere in-between?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-2501682389601435767?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/2501682389601435767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=2501682389601435767' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/2501682389601435767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/2501682389601435767'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2011/02/et-point-of-view-of-neuroscience.html' title='An ET point of view of neuroscience'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-7986788277120836575</id><published>2010-09-27T00:09:00.000-07:00</published><updated>2010-09-27T00:10:04.639-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='semi-supervised learning'/><category scheme='http://www.blogger.com/atom/ns#' term='structured prediction'/><title type='text'>Three types of information for learning structured predictor</title><content type='html'>Structured prediction tries to predict the hidden structure of the input. In other words, it's a prediction problem with multiple outputs that are mutually dependent. An example is sentence parsing, where given an input sentence we want to predict the syntactic parse tree of the sentence. In the classic setting (i.e., &lt;span style="font-style: italic;"&gt;supervised structured prediction&lt;/span&gt;), the training data is:&lt;br /&gt;&lt;br /&gt;1. input data with their hidden structures labeled, e.g., sentences with their syntactic parsing&lt;br /&gt;&lt;br /&gt;And of course, we can also have:&lt;br /&gt;&lt;br /&gt;2. input data without structure labeling, e.g., just sentences&lt;br /&gt;&lt;br /&gt;When both type-1 and type-2 info is available, it's called &lt;span style="font-style: italic;"&gt;semi-supervised structured prediction&lt;/span&gt;. When only type-2 info is available, it's an &lt;span style="font-style: italic;"&gt;unsupervised structured prediction&lt;/span&gt; problem, which is usually very hard.&lt;br /&gt;&lt;br /&gt;For many structured prediction problems, there is a related binary classification problem, which distinguishes valid inputs from invalid ones. One can view a valid input as the input having at least one valid structure, and an invalid input as the input having no valid structure. In the field of formal grammar, the binary classification problem is associated with the &lt;a href="http://en.wikipedia.org/wiki/Weak_generative_capacity"&gt;weak generative capacity&lt;/a&gt; of a grammar, and the structured prediction problem is associated with the &lt;a href="http://en.wikipedia.org/wiki/Strong_generative_capacity"&gt;strong generative capacity&lt;/a&gt; of a grammar. So we can have a third type of information:&lt;br /&gt;&lt;br /&gt;3. invalid inputs, e.g., ungrammatical sentences&lt;br /&gt;&lt;br /&gt;The question is, whether this type-3 info can help structured prediction. This is studied in "&lt;a href="http://www.icml2010.org/papers/522.pdf"&gt;Structured Output Learning with Indirect Supervision&lt;/a&gt;" from ICML 2010. They found that it does help, because invalid inputs further restrict the space of correct structures, as illustrated in their Figure.1.&lt;br /&gt;&lt;br /&gt;Now let's turn to the binary classification problem. If both type-2 and type-3 info is available, this is the classic supervised learning; if only type-2 or only type-3 info is available, this is the &lt;span style="font-style: italic;"&gt;one-class classification&lt;/span&gt; problem. So one would wonder, whether type-1 info would help the binary classification? This is also experimented in the same paper, and interestingly, the answer is again yes.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-7986788277120836575?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/7986788277120836575/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=7986788277120836575' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/7986788277120836575'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/7986788277120836575'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2010/09/three-types-of-information-for-learning.html' title='Three types of information for learning structured predictor'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-3845866368557221497</id><published>2010-09-01T22:30:00.000-07:00</published><updated>2010-09-02T06:42:29.478-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='nlp'/><category scheme='http://www.blogger.com/atom/ns#' term='unsupervised learning'/><title type='text'>Unsupervised Multilingual Grammar Induction</title><content type='html'>Grammar induction is the process of learning a grammar (e.g., context-free grammar) from texts. If there's no grammatical annotation of the texts available to the learner, then it's called unsupervised grammar induction. This sounds like (and actually is) a hard problem, especially for complicated grammars like those of natural languages. But every human actually goes through such a procedure in his early life, although many researchers argue that human have more than plain language input when learning the first language, e.g., perceptual context.&lt;br /&gt;&lt;br /&gt;There's some recent work on multilingual grammar induction, that is, learning from the texts of multiple languages simultaneously, in hope that through the commonality of different languages, one can get more useful info than available from a monolingual corpus of limited size. The commonality is usually represented by the alignment of parallel corpus, i.e., the mapping of words between sentences of the same meaning but different languages. In ACL2010, however, there happen to be two papers on unsupervised multilingual grammar induction without using alignment.&lt;br /&gt;&lt;br /&gt;The first is "&lt;a href="http://www.blogger.com/www.aclweb.org/anthology/P/P10/P10-2034.pdf"&gt;Learning Common Grammar from Multilingual Corpus&lt;/a&gt;", which learns a probabilistic context-free grammar for each language. The assumption is, each grammar has the same set of nonterminals, and for each nonterminal, the probability distribution is different but has the same Dirichlet prior. One problem with this is the different ordering of grammatical components of different languages is not taken into account. Also there's no quantitative evaluation in the paper.&lt;br /&gt;&lt;br /&gt;The second paper, "&lt;a href="http://aclweb.org/anthology-new/P/P10/P10-1131.pdf"&gt;Phylogenetic Grammar Induction&lt;/a&gt;", proposes a method that is more complicated. Here a type of dependency grammar is used. The parameters of grammars of different languages are tied, not by a single common prior, but by a set of priors positioned in a phylogenetic tree. The commonality of different languages is represented by the mapping of their Part-of-Speech tags. They did a more complete evaluation, and found that a more accurate/complete phylogenetic tree resulted in better grammars.&lt;br /&gt;&lt;br /&gt;The success of multilingual grammar induction, especially the result of the 2nd paper, has some very interesting implication. One difficulty of unsupervised grammar induction is that, it usually yields an alternative grammar different from the one we use (see error type 3 in my previous post "&lt;a href="http://ai-ml.blogspot.com/2010/07/errors-of-unsupervised-learning.html"&gt;Errors of Unsupervised Learning&lt;/a&gt;"). From the results of these papers, it seems that the set of alternative grammars of each language is more or less different, but they all contains the grammar we use, so multilingual grammar induction can eliminate those alternatives while find the right one. But why? Because of the mysterious &lt;a href="http://en.wikipedia.org/wiki/Universal_grammar"&gt;universal grammar&lt;/a&gt;?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-3845866368557221497?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/3845866368557221497/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=3845866368557221497' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/3845866368557221497'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/3845866368557221497'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2010/09/unsupervised-multilingual-grammar.html' title='Unsupervised Multilingual Grammar Induction'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-150828607578539160</id><published>2010-08-25T19:57:00.000-07:00</published><updated>2010-08-29T16:38:34.151-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>Graphical model structure learning using logistic regression</title><content type='html'>It seems to me that, one of the reasons that learning graphical model structure is hard is that it has to be optimized in a discrete space. In addition, when using score based methods for structure learning (see my &lt;a href="http://ai-ml.blogspot.com/2008/10/graphical-model-structure-learning-with.html"&gt;old post&lt;/a&gt; for some intro), one has to do inference over parameters separately from the search of structures, because a parameter set is specific to a structure.&lt;br /&gt;&lt;br /&gt;One strategy that circumvents these problems is to encode the structure in parameters. That is, we assume a fully connected graph, and the parameters will indicate which edges are effective. The prior on structures can now be added into the prior over parameters. For example, L0 or L1 regularization of parameters can enforce structure sparsity.&lt;br /&gt;&lt;br /&gt;One problem with this strategy is that, in the case of discrete variables, a fully connected graph means we have exponentially many parameters. To avoid this problem, one can use micro-models for CPT, e.g., noisy-OR, logistic regression, etc. The L1-regularized logistic regression is a well-studied problem, which can be optimized efficiently.&lt;br /&gt;&lt;br /&gt;It turns out that similar ideas have already been studied in recent years. For example, &lt;a href="www.cs.cmu.edu/%7Epradeepr/papers/graphl1nips06.pdf"&gt;this paper&lt;/a&gt; from NIPS 2006 used L1-regularized logistic regression to learn the structure of discrete Markov network. The case of learning Bayesian network structure is more difficult, because there's no intuitive way to represent edge directions in parameters while avoiding cycles. One solution is to first learn the skeleton of the Bayesian network using the above idea, and then determine the edge directions, just like in constraint based structure learning methods. Another solution is to assume the ordering of variables, which uniquely determines the edge directions. An order search can be carried out to find a good ordering. These two methods are discussed in &lt;a href="http://www.aaai.org/Papers/AAAI/2007/AAAI07-202.pdf"&gt;this paper&lt;/a&gt; from AAAI 2007.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-150828607578539160?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/150828607578539160/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=150828607578539160' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/150828607578539160'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/150828607578539160'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2010/08/graphical-model-structure-learning.html' title='Graphical model structure learning using logistic regression'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-2937373163615827769</id><published>2010-08-06T15:46:00.000-07:00</published><updated>2010-08-08T12:26:37.124-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='application'/><category scheme='http://www.blogger.com/atom/ns#' term='cognitive science'/><title type='text'>Mind Reading</title><content type='html'>Decoding neural activities using machine learning methods is an emerging area since a few years ago. The neural data is usually obtained by presenting the word and/or image of a concept to an experiment participant and recording his brain images (e.g., &lt;a href="http://en.wikipedia.org/wiki/Functional_magnetic_resonance_imaging"&gt;fMRI&lt;/a&gt;, &lt;a href="http://en.wikipedia.org/wiki/Electroencephalography"&gt;EEG&lt;/a&gt;). The types of concepts tested so far are very simple (e.g., concrete nouns, and more recently adjective-noun compositions), but I believe experiments on more complex and abstract concepts are to be expected in the near future (or are already on progress!). Given the neural imaging data, one natural task is to find out the mapping between concepts and images. An intermediate layer of semantic features can be added between concepts and images, which is intuitive and also makes things more tractable. So now the problems are what the right semantic features are, and how to find out the mappings between these layers.&lt;br /&gt;&lt;br /&gt;For the first problem, in earlier work this is somewhat manually constructed. In "&lt;a href="http://www.cs.cmu.edu/%7Etom/pubs/science2008.pdf"&gt;Predicting Human Brain Activity Associated with the Meanings of Nouns&lt;/a&gt;" (Science, May 2008), the semantic features are the co-occurrence frequency of the stimulus noun with 25 manually selected verbs in a large corpus. In a more recent paper, "&lt;a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0008622"&gt;A Neurosemantic Theory of Concrete Noun Representation Based on the Underlying Brain Codes&lt;/a&gt;" (PLoS ONE, Jan 2010), these features are discovered from the fMRI data by means of factor analysis (and the result is very interesting: the three main features are related to manipulation, shelter and eating, all of which are the most important things for the survival of our primitive ancestors). With the semantic features specified, the second problem can be done by simply applying common machine learning predictors like Naive Bayes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-2937373163615827769?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/2937373163615827769/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=2937373163615827769' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/2937373163615827769'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/2937373163615827769'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2010/08/mind-reading.html' title='Mind Reading'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-5137558716513629240</id><published>2010-07-22T01:24:00.000-07:00</published><updated>2010-07-22T02:26:20.932-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unsupervised learning'/><title type='text'>Errors of Unsupervised Learning</title><content type='html'>The paper "&lt;a href="http://www.aclweb.org/anthology/P/P08/P08-1100.pdf"&gt;Analyzing the Errors of Unsupervised Learning&lt;/a&gt;" (ACL08) decomposes the errors of unsupervised learning into four types.&lt;br /&gt;&lt;br /&gt;1. Optimization error. In most cases we cannot find the global optimum of the objective function, and we resort to methods like EM or gradient descent to obtain a local optimum. This seems to me the most recognized error.&lt;br /&gt;&lt;br /&gt;2. Estimation error. We only have a finite training set , from which we cannot really uncover the true distribution. Obviously, this error can be reduced by increasing the size of the training set. The paper also found that, larger training set can help reduce the 1st error as well, possibly by making the objective function smoother.&lt;br /&gt;&lt;br /&gt;3. Identi fiability error. There are hidden variables (e.g., labels) in unsupervised learning. So there could be a few different models that have the same maximal objective function value (over observables, e.g., likelihood of observables), but they may have different distance to the true distribution (over both observable and hidden variables).&lt;br /&gt;&lt;br /&gt;4. Approximation error. This is the discrepancy between the best model that maximizes the objective function and the true model. In my opinion this can be further decomposed into two types of errors:&lt;br /&gt;4.1. The model family we learn may not include the true model. For example, using hidden Markov model to fit the data sampled from a probabilistic context-free grammar.&lt;br /&gt;4.2. The objective function we use may not find the best model in the model family. For example, if the objective function is the posterior given the data, then different priors would lead to different models, and the better the prior reflects the true model, the better the resulting model is.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-5137558716513629240?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/5137558716513629240/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=5137558716513629240' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/5137558716513629240'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/5137558716513629240'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2010/07/errors-of-unsupervised-learning.html' title='Errors of Unsupervised Learning'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-4405395548636444908</id><published>2010-03-01T22:28:00.000-08:00</published><updated>2010-03-19T23:49:33.663-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>Measuring Model Complexity</title><content type='html'>In model selection and model averaging, it is important to have a good measurement of model complexity. With two models that fit the data equally well, the simpler model should be preferred (the Occam's razor principle). One reason is that usually the simpler model has better generalizability compared with the more complex one. However, it's not clear what the best way is to measure model complexity.&lt;br /&gt;&lt;br /&gt;The paper "&lt;a href="http://www.blogger.com/books.nips.cc/papers/files/nips22/NIPS2009_1210.pdf"&gt;Measuring Model Complexity with the Prior Predictive&lt;/a&gt;" (NIPS 2009) lists a few aspects of a model that a good complexity measure should take into account.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;the number of parameters&lt;br /&gt;&lt;/li&gt;&lt;li&gt;the functional form (the way the parameters are combined in the model equation)&lt;/li&gt;&lt;li&gt;the parameter range&lt;/li&gt;&lt;li&gt;the prior distribution of the parameters&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;Therefore the complexity measures used in traditional criteria like &lt;a href="http://en.wikipedia.org/wiki/Akaike_information_criterion"&gt;AIC&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Bayesian_information_criterion"&gt;BIC&lt;/a&gt; is not satisfying. This paper proposes a measure called &lt;span style="font-style: italic;"&gt;prior predictive complexity &lt;/span&gt;(PPC). Remember that a model defines a distribution over all possible outcomes (i.e., observables). So PPC is defined to be the range of outcomes that the model puts most of its probability mass on (say, 95%), normalized by the total range of outcomes. Since all four aspects of a model listed above influence the distribution over outcomes, PPC is sensitive to all of them. A natural alternative measure, in my opinion, would be to use the information entropy of the distribution over outcomes defined by the model, so we don't have to artificially specify a threshold like 95%.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-4405395548636444908?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/4405395548636444908/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=4405395548636444908' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/4405395548636444908'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/4405395548636444908'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2010/03/measuring-model-complexity.html' title='Measuring Model Complexity'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-694668404220026679</id><published>2009-12-21T18:24:00.000-08:00</published><updated>2009-12-30T17:08:17.802-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='list'/><title type='text'>NIPS 2009 My List</title><content type='html'>Here is a list of NIPS2009 papers that are interesting to me.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_0929.pdf"&gt;Rethinking LDA: Why Priors Matter&lt;/a&gt;. A second Dirichlet prior is added on top of the Dirichlet prior of the topic distributions in LDA, instead of fixed symmetric hyperparameters. This makes the model much more robust. For efficiency consideration, point estimation can be used on the parameters of the first Dirichlet prior.&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;/b&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_0215.pdf"&gt;Decoupling Sparsity and Smoothness  in the Discrete Hierarchical Dirichlet Process&lt;/a&gt;. A Dirichlet prior (with small parameters) has two effects on the posterior of the discrete distribution (e.g., the topic distribution in LDA): sparsity and smoothness. The two are controlled by the same concentration parameter. To decouple the two, this paper adds a set of Bernoulli variables to the model.&lt;/li&gt;&lt;li&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_0190.pdf"&gt;Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations&lt;/a&gt;. An application of the beta process (i.e., the Indian buffet process) to computer vision for image dictionary learning, which can be extended for classification, denoising and inpainting, etc.&lt;/li&gt;&lt;li&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_0639.pdf"&gt;Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning&lt;/a&gt;. Discriminative and generative languages  learning differ in the use of the "implicit negative evidence" i.e., the absence of sentences. Psychological experiments show that human are capable of both, depending on how the task is presented.&lt;/li&gt;&lt;li&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_1030.pdf"&gt;Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs&lt;/a&gt;. In problems like grammar induction, dynamic programming is usually used to compute expectations, which can be very slow on large data. Pruning can accelerate the computation but introduces bias. Here is a new technique that uses MCMC.&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;/b&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_0125.pdf"&gt;Reading Tea Leaves: How Humans Interpret Topic Models&lt;/a&gt;. A human evaluation of the quality of the topics learned by topic models. They actually used Amazon Mechanical Turk to conduct the evaluation. One interesting result is that the held-out likelihood (or perplexity) may have a negative correlation with the measured quality of the topics.&lt;/li&gt;&lt;li&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_1109.pdf"&gt;Posterior vs Parameter Sparsity in Latent Variable Models&lt;/a&gt;. A novel kind of sparsity bias that uses the posterior plus a regularization term as the object function, which is optimized by an EM-like algorithm. The regularization term can represent sparsity bias that can't be easily expressed as priors of model parameters (e.g., the Dirichlet prior).&lt;/li&gt;&lt;li&gt;&lt;a href="http://books.nips.cc/papers/files/nips22/NIPS2009_1100.pdf"&gt;An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism&lt;/a&gt;. An extension of Indian buffet process that adds a second (or more) layer of features, which are connected to the first layer by noisy-or, i.e., a low level feature is the noisy-or of a few high level features. This new model learns more compact features and has better performance in various tasks than IBP.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-694668404220026679?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/694668404220026679/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=694668404220026679' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/694668404220026679'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/694668404220026679'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2009/12/nips-2009-my-list.html' title='NIPS 2009 My List'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-853770345872989660</id><published>2009-09-17T20:22:00.000-07:00</published><updated>2009-09-17T21:13:51.335-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='supervised learning'/><category scheme='http://www.blogger.com/atom/ns#' term='unsupervised learning'/><title type='text'>Curriculum Learning</title><content type='html'>Most machine learning algorithms look at their training data all at once. However, human usually learn in a different, more organized way. We start with simple cases, concepts, or skills, and then gradually increase the difficulty level of the learning. Can this learning strategy also apply to machines? This is what the ICML-09 paper "&lt;a href="http://www.kyb.tuebingen.mpg.de/bs/people/weston/papers/curriculum.pdf"&gt;Curriculum Learning&lt;/a&gt;" tries to address. In their three experiments on toy classification problems, shape recognition and language modeling, curriculum learning was shown to be both faster and more accurate than learning without curriculum. Some important questions regarding curriculum learning are discussed but remain largely open. First of all, why does curriculum learning help (or in some cases, doesn't help)? One speculation is that at the beginning of learning, difficult samples can confuse the learner instead of help. A related question is, how shall we find, either for a specific problem or in general, a good curriculum? In their experiments they used some quite intuitive curriculum; but an intuitive curriculum doesn't always help (for example, see &lt;a href="http://nlpers.blogspot.com/2009/06/icmlcoltuai-2009-retrospective.html"&gt;Hal Daume's comment on this paper in his blog&lt;/a&gt;).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-853770345872989660?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/853770345872989660/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=853770345872989660' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/853770345872989660'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/853770345872989660'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2009/09/curriculum-learning.html' title='Curriculum Learning'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-4775175625480619487</id><published>2009-06-06T07:53:00.001-07:00</published><updated>2009-06-07T05:50:55.522-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>Convergence of Loopy Belief Propagation</title><content type='html'>I've been using &lt;a href="http://en.wikipedia.org/wiki/Belief_propagation#Approximate_algorithm_for_general_graphs"&gt;loopy belief propagation&lt;/a&gt; (LBP) for inference recently. LBP is exact on tree-structures; but on graphs with cycles it is approximate and may not even converge. Generally speaking, the less cycles the graph contains and more smooth the factors are, the more likely LBP will converge. Unfortunately, the graphical model that I'm working on is a complete graph; even worse, part of the factors defined over the graph encode hard constraints. In addition, I'm using the max-sum version of LBP, which is thought to be more prone to oscillation than the sum-product version. The combined effect of all these was a nightmare of oscillation, even in simple cases.&lt;br /&gt;&lt;br /&gt;There doesn't seem to be much work done on how to help LBP converge, and here is what I've found and tried.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Damping. That is, when updating a message, set it to a weighted average of its old values and new values. This somewhat slows down the pace of LBP and avoids possible overshooting.&lt;/li&gt; &lt;li&gt;Using a more informed message passing schedule. Instead of updating all the messages synchronously or in a fixed order, we can choose the next message according to some criterion. A very simple yet powerful schedule is called "&lt;a href="http://www.robotics.stanford.edu/%7Egalel/papers/ElidanRBP.pdf"&gt;residual belief propagation&lt;/a&gt;" (RBP), which always updates the message that has the most change (residue). An improved (but maybe less general) version of RBP is &lt;a href="http://www.cs.berkeley.edu/%7Ecasutton/publications/rbp0.pdf"&gt;RBP0L&lt;/a&gt;, which estimates the residues instead of actually computing them.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;With damping and RBP, the convergence of LBP on my problem seems much better. There are some other more complicated methods like "tree-reweighted belief propagation", but I haven't looked into them.&lt;br /&gt;&lt;br /&gt;With this experience, I suspect that the convergence problem is one of the reasons that LBP is less used for graphical model inference than MCMC and variational methods. For the latter two, in general we would know in advance whether we can use them or not; but for LBP, in many cases running it seems the only way to know if it works or not.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-4775175625480619487?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/4775175625480619487/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=4775175625480619487' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/4775175625480619487'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/4775175625480619487'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2009/06/convergence-of-loopy-belief-propagation.html' title='Convergence of Loopy Belief Propagation'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-3410609733642251741</id><published>2009-04-15T18:00:00.000-07:00</published><updated>2009-04-15T16:23:34.202-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='supervised learning'/><title type='text'>Discretization for naive Bayes</title><content type='html'>When using naive Bayes, quantitative attributes are usually discretized (binning). Two simple ways of discretization are partitioning the data into k intervals of equal width, or into k intervals so each contains equal number of data points. There are also supervised methods, like the one used in &lt;a href="http://www.cs.waikato.ac.nz/ml/weka/"&gt;Weka&lt;/a&gt; discretizer, which recursively selects the cut-point based on information gain (just like in some decision tree construction algorithms). In the paper "&lt;a href="http://springerlink.com/content/e30q39mt02g47208/?p=47534a26a4c94efcb3685a21f3ef98de&amp;amp;pi=2"&gt;Discretization for naive-Bayes learning: managing discretization bias and variance&lt;/a&gt;" (MLJ, Jan 09), the authors find out that, increasing the number of intervals helps reduce the bias of the learned NB classifier, while the number of data points within each interval (called "interval frequency") shall be large enough to reduce the variance of the learned classifier. Therefore, when we have a fixed size of training data, there's a trade-off between the interval number and frequency. Based on this observation they propose two novel discretization methods. "Proportional discretization" requires the interval number and frequency to be the same; "fixed frequency discretization" simply fixes the interval frequency to a sufficiently large value. Their experiments showed that these two methods outperformed existing discretization techniques.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-3410609733642251741?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/3410609733642251741/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=3410609733642251741' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/3410609733642251741'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/3410609733642251741'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2009/04/discretization-for-naive-bayes.html' title='Discretization for naive Bayes'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-6022948955873579423</id><published>2009-04-09T21:59:00.000-07:00</published><updated>2009-04-09T22:39:07.257-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unsupervised learning'/><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>More on Indian buffet process</title><content type='html'>I had my lunch today in an Indian buffet for the first time. Since I knew nothing of Indian food, I decided to follow the Indian buffet process, as described in &lt;a href="http://ai-ml.blogspot.com/2009/03/indian-buffet-process.html"&gt;my last post&lt;/a&gt;. That is, I chose dishes with probabilities proportional to their popularity, but also chose some random new dishes. The result was good. That said, however, I still don't think I have enough courage to test the &lt;a href="http://en.wikipedia.org/wiki/Chinese_restaurant_process"&gt;Chinese restaurant process&lt;/a&gt; in a Chinese restaurant :)&lt;br /&gt;&lt;br /&gt;Back to mathematics. Remember that the Chinese restaurant process actually produces samples according to a distribution that is sampled from a &lt;a href="http://en.wikipedia.org/wiki/Dirichlet_process"&gt;Dirichlet process&lt;/a&gt;; and the stick-breaking construction is another way to sample from the Dirichlet process. It turns out that there exist counterparts for the Indian buffet process (IBP). The paper "&lt;a href="http://www.cs.berkeley.edu/%7Ethibaux/Projects/AISTATS_07/hbp.pdf"&gt;Hierarchical beta processes and the Indian buffet process&lt;/a&gt;" (AISTATS 07) shows that for IBP the beta process is the counterpart of the Dirichlet process; and the paper "&lt;a href="www.gatsby.ucl.ac.uk/%7Edilan/papers/aistats2007.pdf"&gt;Stick-breaking Construction for the Indian Buffet Process&lt;/a&gt;" (also in AISTATS 07) shows a stick-breaking construction for IBP.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-6022948955873579423?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/6022948955873579423/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=6022948955873579423' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/6022948955873579423'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/6022948955873579423'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2009/04/more-on-indian-buffet-process.html' title='More on Indian buffet process'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-8042259797060660859</id><published>2009-03-25T13:16:00.000-07:00</published><updated>2009-03-26T23:11:42.787-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unsupervised learning'/><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>Indian Buffet Process</title><content type='html'>Nonparametric models don't have a fixed number of effective parameters, but instead have it determined by data. This is very useful if we don't know the parameter number in advance. For example, in &lt;a href="http://en.wikipedia.org/wiki/Gaussian_mixture_model"&gt;Gaussian mixture model&lt;/a&gt; we have to specify &lt;span style="font-style: italic;"&gt;K&lt;/span&gt;, the number of Gaussian distributions (components). If we want to use the model to fit some data but don't know the value of &lt;span style="font-style: italic;"&gt;K&lt;/span&gt;, we have to make a set of models with different values of &lt;span style="font-style: italic;"&gt;K&lt;/span&gt;, and make our choice based on some criteria like &lt;a href="http://en.wikipedia.org/wiki/Akaike_information_criterion"&gt;AIC &lt;/a&gt;or &lt;a href="http://en.wikipedia.org/wiki/Bayesian_information_criterion"&gt;BIC&lt;/a&gt;. With a nonparametric model (more specifically, using &lt;a href="http://en.wikipedia.org/wiki/Dirichlet_process"&gt;Dirichlet process&lt;/a&gt; (DP), or equivalently, &lt;a href="http://en.wikipedia.org/wiki/Chinese_restaurant_process"&gt;Chinese restaurant process&lt;/a&gt; (CRP) as the prior distribution), we assume an infinite number of components, and do a posterior inference which determines how many of them are actually used. The expected number of effective components is controlled by a hyperparameter.&lt;br /&gt;&lt;br /&gt;Because of the popularity of mixture models in AI and machine learning, DP (and CRP) becomes one of the best known nonparametric methods. One limitation of mixture models is that each data point belongs to exactly one component, while in some cases we'd like to associate each data point with multiple labels (or no label at all). For example, an object may have several features, or belong to multiple classes at the same time. To go beyond mixture models, we need a new nonparametric method. The NIPS 06 paper "Infinite latent feature models and the Indian buffet process" by Griffiths and Ghahramani described such a model. Since we can represent the labels of a data point by a binary vector, the labeling of a set of data points can be represented by a matrix. By taking the limit of the distribution over such matrices (actually the equivalence classes of such matrices so that the label order doesn't matter; see their paper) as the number of labels approaches infinity, they get an extension of CRP where each data point can have multiple labels. Just like CRP, they also described a (restaurant-related) stochastic process that samples from the distribution, called Indian buffet process (IBP). Now instead of choosing a table, each customer need to choose buffet dishes, while the number of dishes they can choose is unfixed.&lt;br /&gt;&lt;br /&gt;An interesting example is shown in their paper demonstrating how IBP can be used to generate features by learning from a set of images, each containing zero to four objects (with fixed positions however). By running Gibbs sampling on their model with IBP as the prior, they got seven features, where the four most frequent ones correspond exactly to the four objects.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-8042259797060660859?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/8042259797060660859/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=8042259797060660859' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/8042259797060660859'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/8042259797060660859'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2009/03/indian-buffet-process.html' title='Indian Buffet Process'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-315153002909622200</id><published>2009-03-13T14:51:00.000-07:00</published><updated>2009-03-13T15:38:54.117-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='supervised learning'/><title type='text'>Learning to select features</title><content type='html'>Effective feature selection can be computationally expensive when the feature number is large; and in some cases feature values are expensive to get, so we don't want to examine every feature with the traditional feature selection methods (see &lt;a href="http://www.cs.ualberta.ca/%7Egreiner/BudgetedLearning/"&gt;budgeted learning&lt;/a&gt;). This paper, "&lt;a href="http://jmlr.csail.mit.edu/papers/v9/krupka08b.html"&gt;Learning to Select Features using their Properties&lt;/a&gt;" (JMLR 08), proposes a simple idea to such problems, that they learn to predict good features based on a set of meta-features. Here meta-features are a set of properties of the features. For example, in object recognition people sometimes use the presence of a set of prototype patterns as features, then the meta-features could be the statistics of such prototype patterns. So in their method, they first evaluate only a subset of features using traditional feature selection methods, and then use the result to train a predictor that can be used to select unmeasured features. In their experiments they show their method achieved classification accuracy similar to the traditional methods but was much faster. Unfortunately meta-features have to be specified manually, but nevertheless in many cases it's much more obvious than selecting features.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-315153002909622200?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/315153002909622200/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=315153002909622200' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/315153002909622200'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/315153002909622200'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2009/03/learning-to-select-features.html' title='Learning to select features'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-7208212193104739023</id><published>2009-01-05T20:57:00.000-08:00</published><updated>2009-01-05T22:11:16.911-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cognitive science'/><title type='text'>Modeling online sentence parsing by particle filter</title><content type='html'>This is an interesting paper from NIPS2008: &lt;a href="http://idiom.ucsd.edu/%7Erlevy/papers/levy_reali_griffiths_2009_nips.pdf"&gt;modeling the effects of memory on human online sentence processing with particle filters&lt;/a&gt;. When we read or listen to a sentence, we receive the words incrementally (i.e., one after another), and construct a mental comprehension of the sentence. Dynamic programming can be used to parse a sentence incrementally, but since it can always find the right parsing, it can't explain why people may fail to comprehend "&lt;a href="http://en.wikipedia.org/wiki/Garden_path_sentence"&gt;garden-path sentences&lt;/a&gt;" in their first attempt. For example, when reading the sentence "the horse raced past the barn fell", we may fail at the last word because we are likely to take "raced" as the main verb before reading the last word "fell". Previous work used pruning to model such an effect: only a set of high-probability partial parsings are kept after receiving each word, so the correct parsing may be dropped halfway. This paper adopts a different idea: in a resource-bounded way, estimating the posterior of partial parsings given the words that have been received. So &lt;a href="http://en.wikipedia.org/wiki/Particle_filter"&gt;particle filter&lt;/a&gt; becomes a natural choice, where each word is an observation and the partial parsings are the hidden states. Again, the right parsing may be dropped halfway because only a set of particles are maintained, which explains the garden-path effect. In addition, this method can explain the "digging-in effect", which says a longer garden-path sentence is harder to comprehend than a shorter one. For example, compare this sentence with the previous one: "the horse raced past the barn that is big and old fell". The explanation is, with more words in the sentence before the disambiguation point, the particles for the right parsing are more likely to be dropped due to resampling.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-7208212193104739023?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/7208212193104739023/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=7208212193104739023' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/7208212193104739023'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/7208212193104739023'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2009/01/modeling-online-sentence-parsing-by.html' title='Modeling online sentence parsing by particle filter'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-8974311579919220460</id><published>2008-12-10T21:51:00.000-08:00</published><updated>2008-12-11T15:53:59.770-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>Robust constraint-based graphical model structure learning</title><content type='html'>Months ago, I had a &lt;a href="http://ai-ml.blogspot.com/2008/10/graphical-model-structure-learning-with.html"&gt;post&lt;/a&gt; about the work by Bromberg and Margaritis on graphical model structure learning. Here is another paper by them, "&lt;a href="http://www.cs.iastate.edu/%7Edmarg/Papers/Bromberg-Margaritis-IJCAI07-with-header.pdf"&gt;Efficient and Robust Independence-Based Markov Network Structure Discovery&lt;/a&gt;" [IJCAI-07], which seems even more interesting to me.&lt;br /&gt;&lt;br /&gt;The paper has two main contributions. The first is making the constraint-based structure learning more efficient, by always choosing the most informative and inexpensive test to perform. The second is making the structure learning robust to unreliable test results. This second goal is the same as in their other work discussed in my previous post, but here they use a totally different method: they update the posterior of the structures given the tests, instead of simply accepting or rejecting structures as in traditional constraint-based methods. Then there's the problem that there are exponentially many structures and you can't keep track of all of them. So their solution is, &lt;a href="http://en.wikipedia.org/wiki/Particle_filter"&gt;particle filter&lt;/a&gt;!&lt;br /&gt;&lt;br /&gt;Remember there are two types of methods for structure learning: score-based and constraint-based (aka. independence-based). Although this method is still constraint-based (because it's based on independence tests), I feel it also resembles score-based methods. By using posterior instead of hard constraints, the problem resembles an optimization problem instead of a constraint satisfaction problem.&lt;br /&gt;&lt;br /&gt;I'm currently refreshing/supplementing my knowledge on particle filter, and perhaps my next post will be about it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-8974311579919220460?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/8974311579919220460/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=8974311579919220460' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/8974311579919220460'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/8974311579919220460'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/12/efficient-and-robust-graphical-model.html' title='Robust constraint-based graphical model structure learning'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-949597755222953386</id><published>2008-11-26T17:49:00.000-08:00</published><updated>2008-11-26T18:26:23.876-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unsupervised learning'/><title type='text'>Soft clustering with exemplar-based model</title><content type='html'>My &lt;a href="http://ai-ml.blogspot.com/2008/11/clustering-by-affinity-propagation.html"&gt;previous post&lt;/a&gt; discussed affinity propagation, which is a hard clustering algorithm, i.e., it assigns each data point to exactly one cluster. Soft clustering, on the other hand, gives the degree or probability of one data point belonging to each cluster. This could be more flexible than hard clustering in many scenarios. The most common way of doing soft clustering is running &lt;a href="http://en.wikipedia.org/wiki/Expectation-maximization_algorithm"&gt;EM&lt;/a&gt; on a mixture model (typically a Gaussian mixture). The problem is that the objective function (likelihood) may contain many local optima, making the algorithm unlikely to find a good solution in one run.&lt;br /&gt;&lt;br /&gt;Remember that affinity propagation is different from typical clustering in that it requires cluster centers to be data points (exemplars). We can extend this idea to soft-clustering, and doing this greatly reduces the search space. Even better, this paper "&lt;a href="http://people.csail.mit.edu/polina/papers/LashkariGolland_NIPS07.pdf"&gt;Convex Clustering with Exemplar-Based Models&lt;/a&gt;" at NIPS-07 finds that the search space is actually convex. So, the problem becomes a convex optimization problem, and the global optimum can be found. Of course there's a downside of being exemplar-based, that we have to assume there are data points close enough to the real cluster centers, which is not always true.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-949597755222953386?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/949597755222953386/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=949597755222953386' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/949597755222953386'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/949597755222953386'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/11/soft-clustering-with-exemplar-based.html' title='Soft clustering with exemplar-based model'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-8868851934828467317</id><published>2008-11-11T20:40:00.000-08:00</published><updated>2008-11-11T20:45:52.497-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='unsupervised learning'/><title type='text'>Clustering by affinity propagation</title><content type='html'>Affinity propagation (AP) is a novel clustering algorithm, first published in NIPS 05 (&lt;a href="http://books.nips.cc/papers/files/nips18/NIPS2005_0799.pdf"&gt;Mixture modeling by Affinity Propagation&lt;/a&gt;) and then in &lt;span style="font-style: italic;"&gt;Science&lt;/span&gt;, Feb 07 (&lt;a href="http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf"&gt;Clustering by Passing Messages Between Data Points&lt;/a&gt;). Slightly different from common clustering setting, it tries to find a subset of the date points as the cluster centers called "exemplars". It iteratively passes two types of messages between each pair of data points. The first type of messages is called &lt;span style="font-style: italic;"&gt;responsibility&lt;/span&gt;, which is how much the source date point prefers the target data point as its exemplar. The other type of messages is called &lt;span style="font-style: italic;"&gt;availability&lt;/span&gt;, which is how much the source data point would like to be the exemplar of the target data point. These two types of messages influence each other (for example, if &lt;span style="font-style: italic;"&gt;b&lt;/span&gt; is more available to &lt;span style="font-style: italic;"&gt;a&lt;/span&gt; than &lt;span style="font-style: italic;"&gt;c&lt;/span&gt; is, then &lt;span style="font-style: italic;"&gt;a&lt;/span&gt; would probably prefer &lt;span style="font-style: italic;"&gt;b&lt;/span&gt; to &lt;span style="font-style: italic;"&gt;c&lt;/span&gt; as its exemplar), so we can update them iteratively until convergence. Then we combine responsibility with availability to determine the clustering. Very intuitive right? A nice feature is that AP can determine the cluster number automatically, based on the a priori setting of how preferably each data point is an exemplar.&lt;br /&gt;&lt;br /&gt;What impressed me is that, such an intuitive algorithm is actually derived from a standard inference algorithm on &lt;a href="http://en.wikipedia.org/wiki/Factor_graph"&gt;factor graphs&lt;/a&gt;. A factor graph can be constructed representing the factorization of the energy function of the clustering as well as the constraint that each cluster center must belong to the cluster. The max-sum algorithm (a message passing algorithm, highly related to the loopy belief propagation algorithm on Bayesian networks) can then be used to find an approximate solution (i.e., clustering) that minimizes the energy function. The original max-sum algorithm has exponential time complexity on this fact graph, but can be simplified to O(n^2), resulting in the AP algorithm.&lt;br /&gt;&lt;br /&gt;Some additional story:&lt;br /&gt;One year after the publication of AP in &lt;span style="font-style: italic;"&gt;Science&lt;/span&gt;, there was a technical comment published in &lt;span style="font-style: italic;"&gt;Science&lt;/span&gt;, Feb 2008. The authors of the comment found that a 40-year-old algorithm called vertex substitution heuristic or VSH (a greedy search algorithm starting from a random initialization) often produces lower error rate than AP with comparable computational time. So they concluded that "...(AP) although a welcome addition to the clustering literature, does not possess the relative advantages over existing procedures that were touted in (the AP paper)." Apparently they were not very happy with AP. Then there was a response from the AP guys, in the same issue of &lt;span style="font-style: italic;"&gt;Science&lt;/span&gt;. They confirmed the experimental results in the comment, but they also tested some much larger data sets. What they found is that, AP and VSH had comparable error rates over all the data sets, but VSH took significantly more time on larger data sets. On the largest data set that they tested (with 18k data points), AP took 2 hours while VSH took 10 days. So AP is actually much more efficient. Still, I think it would be nice to see how AP compares with many other widely used clustering algorithms on problems of different sizes. From one of the authors' homepage, it seems that they've done some more comprehensive tests, but the result is yet to be published.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-8868851934828467317?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/8868851934828467317/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=8868851934828467317' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/8868851934828467317'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/8868851934828467317'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/11/clustering-by-affinity-propagation.html' title='Clustering by affinity propagation'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-6713567154683722856</id><published>2008-11-01T15:34:00.000-07:00</published><updated>2008-11-01T15:53:40.792-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='semi-supervised learning'/><title type='text'>Learning from positive and unlabeled data</title><content type='html'>In some machine learning scenario, it's hard or unnatural to get negative training samples. For example, it's easy to find a collection of articles &lt;span style="font-style: italic;"&gt;on&lt;/span&gt; some topic, but it may be hard to find a collection &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; on some topic. Another example is, in research community people tend to publish their positive results but not negative results. This leads to the research of &lt;span style="font-weight: bold;"&gt;one-class classification&lt;/span&gt;, which learns a classifier from positive data only. One such technique is one-class SVM. However, one-class classifiers are usually sensitive to parameters, and how to do parameter tuning on them is still largely an open problem (correct me if I'm wrong). You can't tune them by means like cross-validation because you don't have negative data!&lt;br /&gt;&lt;br /&gt;So semi-supervised learning comes to rescue. Semi-supervised learning tries to learn a classifier from both labeled and unlabeled data, and unlabeled data is usually easy to get. There have been quite some methods on learning from positive and unlabeled data. A recent one is "&lt;a href="http://portal.acm.org/citation.cfm?id=1401920"&gt;Learning Classifiers from Only Positive and Unlabeled Data&lt;/a&gt;" (KDD 08). This paper makes the assumption of "selected completely at random", which says that whether a positive sample is labeled or not is independent of the data sample itself. From that they make some nice derivations and basically turn the problem into a traditional classification problem, which is simpler than many of the previous methods. The experimental results show that this method is as good as the best previous work, biased-SVM (which is also much less efficient due to parameter tuning).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-6713567154683722856?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/6713567154683722856/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=6713567154683722856' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/6713567154683722856'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/6713567154683722856'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/11/learning-from-positive-and-unlabeled.html' title='Learning from positive and unlabeled data'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-1245340427401533358</id><published>2008-10-19T13:06:00.000-07:00</published><updated>2008-10-19T13:52:06.687-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='application'/><category scheme='http://www.blogger.com/atom/ns#' term='supervised learning'/><title type='text'>Learn to filter spam</title><content type='html'>Almost everyday we receive lots of spam emails. Training classifiers to filter spam is one of the most successful real-life applications of machine learning. But spammers are getting smart to counteract spam filters! One trick they use is good word attack. By inserting a set of words that often appear in legitimate messages but not in spams, they make spams look legitimate and confuse the filters (increasing false negative). Even worse, when end users label such emails as spams, adaptive spam filters will learn from these new training samples, and may associate those good words with spam (increasing false positive). These days we do sometimes receive such emails, don't we?&lt;br /&gt;&lt;br /&gt;So this paper, "&lt;a href="http://jmlr.csail.mit.edu/papers/v9/jorgensen08a.html"&gt;A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters&lt;/a&gt;" (JMLR Jun 08), proposed to use multi-instance learning for spam filters. &lt;span style="font-weight: bold;"&gt;Multi-instance learning&lt;/span&gt; learns from and makes predictions on bags of instances, instead of individual instances. If at least one of the instances in a bag is positive, then the bag is positive; otherwise, the bag is negative. Let a bag be an email, and an instance be a part of the email. You see this is a perfect match for good word attack. Now I'm wondering what spammers will do next....&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-1245340427401533358?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/1245340427401533358/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=1245340427401533358' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/1245340427401533358'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/1245340427401533358'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/10/learn-to-filter-spam.html' title='Learn to filter spam'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-3717569387835054767</id><published>2008-10-12T21:50:00.000-07:00</published><updated>2008-10-12T21:50:00.437-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>Feature selection and graphical model structure learning</title><content type='html'>Feature selection is a technique used in machine learning (classification or regression) to select a subset of the input features so that the problem size is decreased while the learning performance is not compromised. Graphical model structure learning is to learn the graph structure that best represents the independency relations among the random variables of the target probabilistic distribution. At first glance, the two don't seem to be related. But think about it, feature selection tries to find the minimal set of variables that are sufficient to predict the target variable, which is the Markov blanket of the target variable; and finding the Markov blanket is obviously helpful in learning the graphical model structure (and the reverse is also true). This is exactly what this paper "&lt;a href="http://jmlr.csail.mit.edu/papers/v9/pellet08a.html"&gt;Using Markov Blankets for Causal Structure Learning&lt;/a&gt;" in a recent issue of JMLR does.&lt;br /&gt;&lt;br /&gt;Btw, the first two posts of this blog are both about learning graphical model structure, but... that's mere coincidence! My current research is only remotely related to this area. I'll talk about some other topic in the next post.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-3717569387835054767?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/3717569387835054767/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=3717569387835054767' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/3717569387835054767'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/3717569387835054767'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/10/feature-selection-and-graphical-model.html' title='Feature selection and graphical model structure learning'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-1790781823839325776</id><published>2008-10-02T15:00:00.000-07:00</published><updated>2008-10-04T15:43:52.547-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='graphical model'/><title type='text'>Graphical model structure learning with unreliable tests</title><content type='html'>Just listened to an interesting talk by &lt;a href="http://www.cs.iastate.edu/%7Edmarg/"&gt;Dimitris Margaritis&lt;/a&gt; on his work (with Facundo Bromberg) "Improving the Reliability of Causal Discovery from Small Data Sets using Argumentation", which will be published in JMLR. It's a new constraint based method of learning graphical model structures, in the presence of inaccurate independence tests.&lt;br /&gt;&lt;br /&gt;Learning  graphical model structures is an important but difficult problem in machine learning, in which one has to learn the structure of the graphical model from a set of data generated by the model. There are two types of methods for that, score-based and constraint-based. The former searches for a structure (usually by greedy search) that maximizes some score, e.g. likelihood or posterior. The latter does independence tests among variables, takes the results as constraints, and solve a constraint satisfaction problem.&lt;br /&gt;&lt;br /&gt;Traditionally, in the constraint-based method, each independence test removes those candidate structures inconsistent with the test result (according to a set of axioms). This actually assumes the tests to be reliable, which however is mostly untrue in practice. So what can be done when facing unreliable independence  tests? This work presented by Margaritis proposes to do multiple tests and find out all the contradictions among the results. There are also a reliability score associated with each test (e.g., p-value). So now we can try to accept and reject testing results to make them consistent as well as optimize the overall reliability. Margaritis actually uses a simple, somewhat greedy method which, in my understanding, may not optimize the reliability, but the experimental results are still better than the traditional methods that don't consider the reliability issue.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-1790781823839325776?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/1790781823839325776/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=1790781823839325776' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/1790781823839325776'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/1790781823839325776'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/10/graphical-model-structure-learning-with.html' title='Graphical model structure learning with unreliable tests'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-2011464264584834981</id><published>2008-09-16T00:00:00.000-07:00</published><updated>2008-10-20T16:29:51.159-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='this blog'/><title type='text'>Guestbook</title><content type='html'>&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-2011464264584834981?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/2011464264584834981/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=2011464264584834981' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/2011464264584834981'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/2011464264584834981'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/09/guestbook.html' title='Guestbook'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6054902935917765633.post-3293121053250842560</id><published>2008-09-15T07:48:00.000-07:00</published><updated>2008-10-04T15:23:24.925-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='this blog'/><title type='text'>About this blog</title><content type='html'>So I'm starting this blog, mainly to share and record what I learn (from conferences, journals, textbooks, etc.) in artificial intelligence (AI), machine learning (ML), natural language processing (NLP), knowledge representation and reasoning (KR), cognitive science (CogSci), and other related fields. Among these areas, I will probably focus more on machine learning, which is my current research area. Another motivation of starting this blog is to make something that gives me some additional push to keep working while at the same time adds some fun.&lt;br /&gt;&lt;br /&gt;I'll try to write a new post every one or two weeks, unless I'm too busy in order to meet some hard deadline. If you don't find any update for over a month, chances are I'm again being lazy, so please then kick me some comments to remind me :)&lt;br /&gt;&lt;br /&gt;Comments on my posts are more than welcomed, as I believe discussion always leads to better understanding. Also since I'm not a native English speaker, please forgive me in case you find some "Engrish" in my posts, and I'd appreciate it if you can point out any mistake or vagueness.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6054902935917765633-3293121053250842560?l=ai-ml.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://ai-ml.blogspot.com/feeds/3293121053250842560/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=6054902935917765633&amp;postID=3293121053250842560' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/3293121053250842560'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6054902935917765633/posts/default/3293121053250842560'/><link rel='alternate' type='text/html' href='http://ai-ml.blogspot.com/2008/09/about-this-blog.html' title='About this blog'/><author><name>TU</name><uri>http://www.blogger.com/profile/14195087176810390172</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
