Paper ID: | 9289 |
---|---|

Title: | Generalization in multitask deep neural classifiers: a statistical physics approach |

Overall the paper is clear and easy to follow. The proofs and main derivations are all in the appendix and while I did not thoroughly check them they are relatively straightforward. The idea of analyzing the training dynamics by taking a tractable simple model, making approximations, and deriving both qualitative and quantitative conclusions which can be validated experimentally is quite interesting. The particular approach adopted here to analyze multitask learning in particular is novel.

Originality: The theoretical results on single and multitask learning in classification networks are novel as in previous literature the focus was mainly on regression networks. The key analysis are although inspired by [16] but tackling the cross entropy loss function in this setting is original. Quality: The theoretical derivation of dynamics of generalization of both single task and multitask classification networks and its comparison to empirical results thorugh well thought experiments makes this paper a good contribution. The results of deeper non linear networks showing a similar trends to shallow linear networks makes this contribution much stronger. Clarity: The paper is well-written, the theory clearly described and presentation of results through descriptive figures. Significance: The contribution is important for understanding generalization dynamics of classification networks. The theoretical results and their empirical validation on generalization dynamics in multitask setting is very relevant in complex setups where multiple auxiliary tasks can aid in learning a new tasks.

## Overall The paper presents an interesting analysis of linear classification networks using the student-teacher framework. The experiments on multitask learning are informative. I wish the experiments and theory were a bit more integrated. See my comments below for more details. ## Writing + The paper is clearly written. The authors moved a lot of details to the appendix while keeping the main conclusions in the main submission to ease understanding. - Often, the authors state results without referring to the corresponding equation number in Appendix. Here are some examples: (a) L181-184 what equation shows (s_A - \tilde{s_A}) depends on the said 4 things; (b) L185-186 when labelled data is scarce why is (\bar{s_A*g(s_A)}-\tilde{s_A*g(s_A)} < 0; (c) L189-190 why does (\bar{s_A*g(s_A)}-\tilde{s_A*g(s_A)} tend to 0 when training data is abundant. It's not obvious where these results are coming from or why they are true. - The key takeaways or results need to be more explicitly stated - The title reads "a statistical physics approach" but its unclear what techniques are borrowed from physics. The authors only refer to a connection to the physics of the disordered systems loosely. If this connection is important, please state the result in physics describing the physical system and variables involved, and then draw analogies to learning dynamics in the network. ## Empirical Evaluation + The experiments on multitask learning are illustrative + Fig 2 quite cleverly manages to show the relation between 5 different variables - Having said that, the figure is quite complicated and the main conclusions are getting a bit lost in a wall of plots. My suggestion would be to break Fig 2. into smaller plots considering only a subset of the variables necessary to make the point (e.g. by fixing the SNRs to a certain value when trying to see the effect of training data size). - The multitask experiments seem a bit disconnected from the analysis done earlier. The authors are encouraged to refer to the relevant equations in the experiments and point out how the empirical conclusions match the theory. ============================================================ After reading the other reviews and rebuttal I think the paper should be accepted. My main concerns were related to writing and based on the rebuttal I trust the authors to update the manuscript accordingly. This is the first theoretically motivated analysis of multitask learning that I have come across and I would love to see more work building up on this. For any revised manuscript, my recommendation would be to include the following: 1. Tab 1 from the rebuttal (also reference the relevant equation in “analytical explanation” column) 2. improved explanation of Derrida’s Random Energy Model to make the paper self contained (most readers would be unfamiliar with it; ok to put this in Appendix). 3. add references to relevant equations in the appendix and consider adding the key equations to the main manuscript.