Three-way data reduction based on essential information - Monday webinar

Estimated read time: 1:20

Summary

The Monday webinar hosted by Chemometrics & Machine Learning in Copenhagen featured Rafael Vital from the University of Lille, France. He discussed new methods to compress three-way data using essential information. The presentation revolved around a novel algorithmic procedure for reducing the complexity of data sets, which is crucial for the field of chemometrics. Vital introduced techniques including principal component analysis and archetype identification, demonstrating the efficiency of this method in preserving essential data while achieving significant computational savings. He provided examples with fluorescence spectral data and hyperspectral images, illustrating the potential for this approach in real-world applications.

Highlights

Rafael Vital shared insights into compressing freeway data by focusing on essential information at a Monday webinar. 📅
Vital's method utilized trilinear analysis like higher order singular value decomposition to effectively manage data reduction. 📉
The approach allows for significant data compression without losing essential qualitative information. 👍
Vital emphasized how the approach could be applied to real-world examples, such as analyzing fluorescence data and pharmaceutical compounds. 💡
During the Q&A, discussions touched upon the potential adjustments and applications of this methodology to various complex data sets. 🗣️

Key Takeaways

Rafael Vital presented innovative ways to compress three-way data using essential information. 🎓
The process steps away from traditional bilinear approaches, moving towards trilinear techniques. 📊
The method involves significant computational savings, sometimes speeding up the process by up to 800 times! 🚀
This data reduction technique retains meaningful information while compressing the data set. 🔑
Vital's discussion included both theoretical insights and practical applications, such as analyzing pharmaceutical samples. 💊

Overview

Rafael Vital from the University of Lille delivered an engaging talk on data compression techniques at a recent Monday webinar organized by Chemometrics & Machine Learning in Copenhagen. His work focuses on simplifying three-way data to make it more manageable and insightful for chemometricians. The session kicked off with Vital introducing a procedure for compressing data using essential information, moving beyond conventional bilinear methods.

Throughout the presentation, Vital dived deep into trilinear data compression techniques, particularly highlighting a novel algorithmic approach. He discussed how this strategy significantly reduces the computational load, often speeding up the process by as much as 800 times, while ensuring that essential information is preserved. Examples of the method's application included the pharmaceutical industry and fluorescence spectral analysis, showcasing its applicability and versatility.

The webinar concluded with an interactive Q&A session where attendees explored further with Vital on how this approach might fit within their own work contexts. The discussions were lively, shedding light on practical concerns such as handling retention time shifts in GCMS data and considerations for higher complexity samples. Overall, Vital's comprehensive breakdown was well-received, contributing valuable insights into modern data analysis practices.

Chapters

00:00 - 01:00: Introduction and Welcome The chapter titled 'Introduction and Welcome' begins with Rasmus from the University of Copenhagen introducing Rafael Vital from the University of L in France. Rafael is set to discuss innovative methods of compressing freeway data. Rasmus informs the audience that there will be time for questions after the presentation and encourages them to write questions in the chat if preferred. Rafael then expresses gratitude to Rasmus and the audience before commencing his talk.
01:00 - 05:00: Data Compression Overview The chapter titled "Data Compression Overview" begins with the speaker expressing gratitude for being invited to speak during Monday webinars. The speaker introducing themselves as Rafael Vital, an associate professor at the University of Lil in France, also mentions the opportunity to share the latest advances in research activities. The introduction serves as a preamble to discussing data compression.
05:00 - 10:00: Spectroscopic Data Analysis In this chapter titled 'Spectroscopic Data Analysis,' the focus is on a specialized lab known for its development and application of spectroscopic and microscopic techniques aimed at characterizing complex life science systems. The speaker introduces a novel algorithmic procedure developed in collaboration with colleagues from Iran and the Netherlands. This procedure aims to reduce and compress datasets that exhibit three-way structures.
10:00 - 15:00: Simplex Geometry and Archetypes The chapter titled 'Simplex Geometry and Archetypes' begins by introducing the concept of Essential Information, an important element in the development of a novel algorithmic procedure. The narrator pauses to provide some historical context and practical scenarios where this concept is particularly beneficial, especially for practitioners of chemometrics. The chapter seems to emphasize the significance of understanding the origins and applications of Essential Information in order to effectively utilize it in professional settings.
15:00 - 20:00: Archetype Identification in Mixture Data Sets This chapter focuses on simplifying the complexity of data analysis by transitioning from three-way arrays to two-way data analysis, specifically within the context of bilinear curve resolution in spectroscopic data. It highlights the common practice in spectroscopy of using analytical platforms to obtain signal profiles known as spectra.
20:00 - 25:00: Spectral Mixing and Curve Resolution This chapter explores how light interacts with samples under investigation, likening the resultant spectrum to a fingerprint of the specimen being studied. It notes that because spectra often result from the linear combination of signals from pure components, the process of spectral mixing and curve resolution becomes critical for accurate analysis. Understanding spectral signatures is essential for identifying and separating the contributions of individual components within a complex mixture.
25:00 - 30:00: Three-way Data Compression The chapter titled 'Three-way Data Compression' discusses the combination of individual elemental systems, compounds sharing molecular backbones, and the analytical chemistry behind it, referencing Beer-Lambert's law. It emphasizes how individual contributions, along with their respective mixture coefficients, encode data about abundance or concentration.
30:00 - 40:00: Practical Examples of Three-way Data Compression This chapter delves into the complexities of three-way data compression by providing practical examples. It begins by highlighting the challenge of understanding intrinsic compositions of systems due to unknown parameters. The process often involves estimating these parameters from available observations. The chapter introduces the concept of blind source separation, which involves resolving blind mixing problems to accurately separate and identify individual data components.
40:00 - 45:00: Conclusion and Future Work In the concluding chapter, the discussion focuses on the intricate process of analyzing spectral profiles. The process is challenged by the variability in mixture coefficients, necessitating a comprehensive approach. Instead of examining spectral profiles individually, it requires the collection of multiple profiles from different spectral mixtures, organized along the rows of a data matrix, denoted as SD. This data matrix is then subject to bilinear factorization, breaking it down into two distinct submatrices, S and C. This method allows for a structured and efficient analysis of spectral data, highlighting the importance of systematic data handling in spectral analysis.
45:00 - 55:00: Questions and Discussion The chapter titled 'Questions and Discussion' provides insights into the spectroscopic analysis and the modeling of component profiles within various samples. It delves into statistical strategies used for successfully resolving and unmixing complex data to understand the concentration and abundance of different sample components.

Three-way data reduction based on essential information - Monday webinar Transcription

00:00 - 00:30 Rasmus bro I'm from University of Copenhagen and today we have Rafael vital from University of L in France who is going to talk about how to compress uh freeway data in new fancy ways so please Rafael go ahead and uh afterwards there will be time for questions Etc or you can write them in the chat uh if you want very good so uh good afternoon everybody uh thanks a lot Rasmus for for
00:30 - 01:00 inviting me once again to speak during your Monday webinars I'm always very happy and honored to be here with you and uh especially to have the chance uh uh to share the latest advances of our research activities uh so welcome to all of you as Rasmus has already announced my name is Rafael vital and I'm currently working as associate professor at the University of Lil in France more specifically Affiliated to the
01:00 - 01:30 lab specialized in the development and application of spectroscopic and microscopic techniques for the characterization of complex life science systems um and today I would like to tell you uh about the novel algorithmic procedure we have recently devised in collaboration with our colleagues from Iran and the Netherlands um in an attempt to reduce and compress data sets exhibiting uh uh three-way structures B based on the
01:30 - 02:00 principle of what we call Essential information however before diving into the details of such a novel algorithmic procedure uh allow me uh uh only for a few minutes to take a step back and show you where this concept of essential information initially stemmed from and why and especially in which scenarios it could be particularly useful for practitioners of chemometrics to do so though we need to
02:00 - 02:30 start by slightly reducing the degree of complexity and move just for a while from the realm of three-way arrays to the one of two-way data analysis more in particular to the domain of bilinear carve resolution of spectroscopic data now I'm almost sure that most of you already know that in spectroscopy one often exploits analytical platforms providing signal profiles called Spectra
02:30 - 03:00 which basically give insights into how light interacts with investigated samples and I'm also sure that most of you also know that a spectrum constitutes a sort of fingerprint of the specimen's under study but that almost always results from linear combinations of individual signal contributions derived from let's say pure components under lying the explor
03:00 - 03:30 systems atoms of the same elements compounds sharing a common molecular backbone Etc as per one of the most popular laws we know in analytical chemistry the beer Lamberts law um in this regard it is worth noticing that it's actually these individual contributions together with their respective combination or mixture coefficients encoding information about the abundance or the concentration of
03:30 - 04:00 their corresponding components within the systems themselves that permit to obtain an accurate idea of their intrinsic composition unfortunately though all these parameters are usually not known a priority and need to be somehow estimated from available observations this defines what we commonly refer to as a blind Source separation or a blind mixing problem whose resolution requires not only one
04:00 - 04:30 but a series of equations like this characterized by a certain degree of of variation in the afor mention mixure coefficients in other words we cannot directly operate on individual spectral profiles but we need to gather multiple of them for different spectral mixtures along the rows of a data Matrix SD and carry out the bilinear factorization of this Matrix as the product of two dis distinct submatrices s and C that
04:30 - 05:00 contain the pure spectroscopic Fingerprints of the components constituting the investigated samples and their related abundance or concentration profiles over all such samples plus of course some model residuals now although there exists a pleora of diverse multivarious statistical strategies to successfully tackled this task the very basic principle of this resolution or a mixing
05:00 - 05:30 process can be easily conceived through the following example imagine that one has uh uh collected a serious of spectral data on all possible mixture combinations of three different components this means that the spectral profiles contained in this data set can be underlay by a single pure component a binary mixtures of two out of three components in different proportion or
05:30 - 06:00 aary mixes of all the three components in different proportions this is of course a simulated example and for the sake of Simplicity here only three pure component Spectra were accounted for these ones highlighted in red now the tary mixture problem this data set intrinsically encodes can be graphically visualized in a very simple intuitive and straightforward way if I I take this data set in fact I subject it
06:00 - 06:30 to principal component analysis and I represent the resulting scores the points Associated to the afor mentioned pure component Spectra only initially fall within the scores Point clouds without defining any easily recognizable Direction in the PCA Subspace however under specific normalization constraints here for example I divided element wise uh um uh
06:30 - 07:00 all component scores vectors by the first component one uh this scores Point Cloud assumes what in chemistry is also known as a two-dimensional Simplex geometry very similar to the ones we are used to observe when mixture designs of experiments are concerned in other words each one of these represented points here are is distributed within a triangle whose
07:00 - 07:30 vertices correspond to the so-called pure samples those containing by definition only one of the three components under study along whose sites are distributed the binary mixtures of the two components such sites actually connect and whose internal area contains theary mixtures of all the three components considered now given this particular representation it goes without saying that that for any linear and mixing
07:30 - 08:00 approach it would only be needed to somehow identify the vertices of this triangular geometry to read solve the resolution problem at end and I think that everybody agrees with me on this particular aspect however contrary to uh uh uh well um in in in statist in sorry in unsupervised learning and statistics the ver es of this triangular geometry um
08:00 - 08:30 constitute a specific case of what are also called archetypes generally speaking archetypes correspond to the most linearly dissimilar observations of a multivariate data set support its multi-dimensional convex tool and share an essential mathematical property all the other objects of the data set indeed can be expressed as convex linear
08:30 - 09:00 combinations of archetypes measurement vectors now based on what we have said now it is pretty clear that in this specific case finding archetypes would readily translate into spectral mixing uh uh uh the investigated data however contrary to this ideal situation in which pure selective information is available in which pure samples were collected and well life may be slightly
09:00 - 09:30 harsher why because most if not all spectroscopy experiments conducted on real life systems leads to the generation of extremely redundant data that is to say that the all registered spectral profiles are normally underlined by linear combinations of the afor mentioned pure signal contributions and this would correspond to a situation like this in which a partial or incomplete Simplex geometry
09:30 - 10:00 is observed across the Subspace of the normalized PCA scores yielded by the data decomposition of course here uh uh archetype identification will not directly enable the resolution of this data set for which proper multivariate spectral mixing approaches would be required nevertheless based on what we said just before that is to say that all
10:00 - 10:30 recorded profiles can be somehow described as linear combinations of archetypes measurement vectors it goes without saying that the archetypes of any mixture data set once again corresponding to uh the points supporting its multi-dimensional convex tool encoding encode by themselves the purest information available
10:30 - 11:00 in other words it is um uh not necessary to uh assess entire data sets for the sake of linear resolution as in principle their soul archetypes are already capable of driving the resolution towards the expected results and this aspect can have a tremendous impact in such a domain especially considering the fact that archetype selection can yeld a dramatic
11:00 - 11:30 reduction of the size of the data to be handled while preventing any significant loss of meaningful or useful information for carve resolution of spectral mixing now let me uh stress once again uh this point here to avoid falling in a common trap that I'm sure s knows very well uh uh uh that is tormenting us since the very beginning of all this story we are perfectly aware of the fact that
11:30 - 12:00 archetype selection does not always guarantee the resolution of a spectral mixing problem since archetypes as in this particular case does not necessarily correspond to Pure component Spectra what we only claim is that selecting archetype or let's say essential Spectra and Performing curve resolution solely on selected AR types can yield an identical in principle data
12:00 - 12:30 factorization as when full data sets are concerned now let me give you an example of what is possible to gain in these terms by archetypes identification and selection prior to a mixing here you have a mid infrared hyperspectral image of a pharmaceutical powder sample containing a mixture of three different compounds citric acid atil salic acid
12:30 - 13:00 and caffeine and this is the representation of the reduced number of essential Spectra to be precise 49 selected as shown before together with their respective pixel locations U now when we talk about multivariate car resolution of hyers spectral images uh one should notice that we uh commonly assimilate them to regular collections of Spectra uh exactly like those I showed
13:00 - 13:30 you before um therefore all that we detailed previously remains valid with the only difference that the afor mentioned abundance or concentration profiles become here special maps how highlighting how the individual components are distributed over the investigated samples this being said uh in order to compare the spectral and mixing carried out out only on the uh Ensemble of
13:30 - 14:00 selected essential Spectra and on the entire data set under study we applied to both of them one of the Workhorse chemometric algorithm conceived with this purpose multivariate curve resolution alternatingly squares also known as MCR ALS as one can easily see the pure spectroscopic fingerprints extracted in both cases are uh uh imperfect agreement and contain uh all the absorption bands
14:00 - 14:30 uh um uh expected for the three different chemical constituents considered also the so-called uh component distribution Maps uh yielded by the two distinct uh decompositions performed uh are basically identical just mind that when running MCR ALS on the subset of essential Spectra these maps are not readly output by the
14:30 - 15:00 algorithm but need to be somehow reconstructed through a non- negative Le squares projection of the entire data set onto the the Subspace spanned by the three resolved spectral profiles summarizing same resolution obtained here but of course analyzing only around two or 3% as far as I remember of the analyz Spectra uh and of course in around three times less computational
15:00 - 15:30 time um now the amazing thing uh uh behind the this concept of essential information extraction regards the fact that such an approach is not limited to only one of the modes of the data under study but can in principle be extended for example to both rows and Columns of a data Matrix this way one could think of achieving even higher uh data
15:30 - 16:00 compression rates while reducing the not only the number of Spectra but also the number of spectral variables to be processed for a for a joint selection of essential Spectra and essential spectral variables adapting the methodological procedure I showed you before is basically straightforward it is only required to estimate the uh um uh convex of the normalized PCA scor Point clouds
16:00 - 16:30 of the original data Matrix and of its transpose and if I consider the same example I showed you before by such a joint selection I can accelerate even farther the MCR alss analysis of my hyperspectral image as shown here uh obtaining once again virtually IND distinguishable spectral mixing outcome
16:30 - 17:00 and now I believe I really don't need to say much uh about how to modify this sele the selection strategy for uh uh handling uh uh uh three-way data sets well actually I don't agree with myself in this case because something particularly important must be said we need in fact to replace the preliminary bilinear PCA DEC composition step with a trilinear one for dimensionality
17:00 - 17:30 reduction of similar arrays and luckily in this case Mr tuer comes to our Aid in the 60s indeed he proposed a method called higher order singular value decomposition or hosvd that can actually be regarded as a sort of multi-way or nway extension of principal component analysis just for the details uh h SVD returns a attacker tree model
17:30 - 18:00 encompassing all possible extractable components and characterized by fully orthogonal component profiles along all the data modes under study and operationally speaking uh such a model is retrieved by performing standard singular value DEC composition on the two-way matrices resulting from the rowwise the columnwise and the tube wise unfolding of uh a generic three-way array so uh uh uh once performed hosvd
18:00 - 18:30 uh it is only needed to somehow take the scores yielded for all the three data modes uh uh considered truncate them and normalize I shown you before and finally estimate the comx SS of the three resulting normalized hosvd scour Point Cloud to be capable of identifying and extracting essential rows essential
18:30 - 19:00 columns and essential tubes of a three-way array and let's now consider whether the uh um advantages uh deriving from uh um uh the selection of essential information in bilinear spectral mixing remain valid even when three-way data sets and especially three linear car resolution approaches namely parallel
19:00 - 19:30 factor analysis alternatingly squares or paraa ALS are concerned the first practical example I'm going to show you uh regards a set of time resolved floresent spectral data we collected in our lab on nine mixes of three fluorescent dice Alexa 647 Alexa 655 and Alexa 665 every for every mixture eight distinct fluoresence decays were
19:30 - 20:00 recorded at eight different emission wavelengths which resulted in a three-way data set of Dimension nine samples times 2,500 time points times eight emission wavelength channels um here you have a very schematic representation of these data set in which I displayed for every mixture sample an individual fluorescence Decay integrated across all
20:00 - 20:30 the spectral channels and an individual spectral profiles integrated across all the Decay time points um um this this data set of course is assume to exhibit the trilinearity among the aforementioned measurement modes since also this fluorescence decays follow multiexponential function that as we said four for the spectral profiles are
20:30 - 21:00 the combination of monoexponential contributions each one related to a given Flor for under study uh here the approach I showed you in my previous slide yielded uh um uh uh enable the identification and selection of six essential rows 24 essential columns and only five essential tubes that together with the original full uh data set were submitted to paraa alss
21:00 - 21:30 for their three linear factorization as one can easily see uh here uh uh uh all the three component uh uh uh uh mode loading profiles uh retrieved from the subset of essential data only uh uh um are basically vir are virtually indistinguishable with respect to the on resulting from a full size
21:30 - 22:00 measurements however the computational time required to estimate them was found to be about 800 times lower than the one uh uh uh uh obtained when such fullsize uh uh um measurements were analyzed if one has also a closer look for example at the essential rows selected please see the DED line in this figure it is evident how four out of six
22:00 - 22:30 relate to chemical mixtures constituted by only two of the three floor Force considered and for which the paraa alss loading profile of the missing component assumes values basically approximately to zero um approximately equal to zero sorry uh this clearly evidences how the the um essential information based selection procedure is actually capable
22:30 - 23:00 of reducing the redundancy of the information encoded in the investigated data prior to their final decomposition emission wavelength characterized also by a strong overlap among the three different paraa ALS loadings profiles were satisfactory identified as nonessential and therefore excluded from uh the uh compression operations the second examples concerns instead a
23:00 - 23:30 246 * 238 sequential illumination fluorescence image of bone tissue cells affected by osteosarcoma and stained with a combination of two different fluorophors debak 43 and concavalin a Alexa 488 uh the pixels of such an image are underlined by six different fluorescence Decay that here for the sake of Simplicity I uh represented all together
23:30 - 24:00 in the same plot um uh um recovered over 45 consecutive time instance across a series of six photo bleaching photo um recovery Cycles during these Cycles the laser power was modulated differently in an attempt to emphasize the distinctive Dynamic behavior of the various fluoresent species uh uh but bound to the individual compartments of the
24:00 - 24:30 system under study also here this decays follow multi-exponential function as the ones I showed you previously that are the combination of monoexponential contributions each one related to a Flor of force located in a particular chemical environment clearly uh before paraa alss processing this image was preliminary unfolded pixel wise which resulted in a
24:30 - 25:00 three-dimensional data structure of size 58,5 48 pixels times six photo bleaching photo recovery cycles and uh uh uh 45 time points assumed once again to exhibit trilinearity among the pixel the cycle and the time mode um as one can easily see also in this case the Essen information base reduction procedure yielded a
25:00 - 25:30 significant uh uh um uh um compression of the data at end which permitted to attain in a substantially lower amount of time namely approximately 30 times faster a trilinear factor model that is basically indistinguishable from the one resulting from the the decomposition of the original measurements collected uh please notice that in this case the
25:30 - 26:00 number of of factor profiles was set to three based on preliminary knowledge available on the number of components expected to be observed in the capture scene um uh also in this case the essential rows the essential columns and the essential tubes selected correspond to rows columns and tubes in correspondence of which distinctive cellular compartments can be found the overlap
26:00 - 26:30 among the paraa ALS loadings profile is minimal or their trend is not particularly correlated which demonstrates once again that the proposed approach is capable of significantly reduced the redundancy of the information encoded in the investigated data and here I want to conclude summarizing the fact that we proposed in this work a uh an approach for the reduction of three-way data Ray based on
26:30 - 27:00 the principle of higher order singular value DEC composition that basically extends previous Works to the domain of three-way data compression this approach is capable of uh uh uh uh retaining uh the most me meaningful information required for a sensible trilinear factorization of such three-way arrays which is the main difference between this technique and other uh uh already proposed
27:00 - 27:30 strategies uh often uh relying on the principle of random samplings the parac as the composition of the reduced data obtained this way yield almost in principle will virtually indistinguishable loadings profile with respect of the situations in which full data sets are analyzed and can take up to 800 Can can be up to 800 times faster than when paraa KS is run on fullsize
27:30 - 28:00 data and of course paraa KS model quality and adequacy is preserved even after this compression uh uh um procedure is carried out um Rasmus I still uh a couple of seconds of the presentation just to uh ask a question to you and to uh all the invited people that are here uh today at
28:00 - 28:30 the moment the higher order singular value decomposition based approach we proposed uh basically accommodates almost I would say uh perfectly the structure of a paraa ALS model because basically we extracted the same number of factors for all the data modes under study and my question was the following would it make sense to somehow modulate the compress trying to extract a different number of factors for
28:30 - 29:00 individual uh uh uh factors to somehow accommodate better the inherent structure of alternative TR linear factorization models like for instance Tacker variance paralind or similar and I take some times as well to for a bit of advertis of advertisement Rasmus if I can uh just to advertise two events that I am directly involved in the ganization of the next colloquium
29:00 - 29:30 chometric mediterraneum that this year is organized in France by us um and that will take place on the island of por Carol in the south of France in September 2025 as well as a closer I would say event that will take place in coimbra in Portugal in May that's an embis spring meeting embis is the European network of industrial and business statistics and well this this event is mainly uh devoted to Young
29:30 - 30:00 scientists willing to share with us experience this year on quality by Design and process analytical technology and more specifically on the application of purely data driven say statistical or AI approaches and gray approaches let's say mixing first principal modeling with totally data driven modeling and with this I thank you all for the attention thank you so much I thought that was beautiful uh um I have some question and comments but before uh I
30:00 - 30:30 would like to ask people uh if if there are any questions comments Etc um and you can write it in the chat or if your microphone is good you can uh just ask the question uh hi Rafa this is Paul yeah um I I was just curious to of the um times
30:30 - 31:00 the computation times does that also account for the time you need to find the essential information yes yes okay so uh this is basically a comment that we received after the first round of review of of our paper basically um the the computational time that I indicated every time I compare the paraa alss run on the full data and the paraa l s I run on the reduced data takes into account
31:00 - 31:30 uh the selection so the identification of the uh essential rows variables and columns um the uh DEC composition of the reduced data as well as the uh uh Le squares projection of the full data to reconstruct the full size loadings profiles let's say okay well that's that's really impressive then uh and yeah thanks for the talk I mean
31:30 - 32:00 it's uh it's really interesting thanks a lot Paul thanks and we have um s just put the paper in the chat so you you can check that there and let's see did we have a question down here yes there's a question uh I'm working on GC gcms and I would like to know how can this technique be applied to this type of data with the issues of retention time shifts yes that's that's true in the sense that uh well this is a I would
32:00 - 32:30 say a more uh classical problem than only related to the compression because uh uh in a certain sense one should guarantee the trilinearity or the multilinearity of the data sets for a further decomposition uh I don't have a direct solution for that unless because we were discussing with Rasmus before this uh uh uh meeting for a couple of times and
32:30 - 33:00 unless this proposal let's say of trying to extract a different number of components uh per mode may help might help in this case because in my opinion uh at least as far as the algorithm is coded and the approach is proposed uh we are not read uh able to handle automatically uh pick shifts or B shifts because this anyway constitutes a breakage of the
33:00 - 33:30 linearity so goes a bit beyond the the the the the rational of the algorithm itself and it enters more the most fundamental question of uh multilinearity properties let's say uh I'm not sure whether this kind of approach once again let's say selecting different number of factors per mode might help in this kind of in this kind of situation yeah sorry Osman I will just add I I I
33:30 - 34:00 agree with you and I think then for example if you do whatever if you do parap two to handle then you can just handle that there's an additional complexity the rank in that mode is higher so you extract more exactly exactly and and in fact this this goes along this kind of question in the sense can we modulate to somehow also accommodate structure like par 2 models anyway I was now thinking and probably an idea popped up in in my in my mind um
34:00 - 34:30 if along the uh uh retention time mode you have shifts why not unfolding on that mode and try to apply essential information selection on the resulting two-way Matrix that's some that's something that can always be done as a as a let's say re uh uh ready to use a solution that one can try let's
34:30 - 35:00 see yeah and that goes with what C said that it's variable selection he wrote in the chat yes it's not the subsequent aspect bring how to just represent your D yes yeah okay sorry I think Osman now is your time I kept you away thanks thanks thanks Rafael for the presentation uh I can see that the models that you are play with are rather simple in my mind with the complex
35:00 - 35:30 samples that we are looking at how does it and uh we have just experienced that using like a gener squares procedure or like a general trilinear composition does not give the global Minima ever uh have you played do you have any so to be honest I have not played with really complex samples let's say uh
35:30 - 36:00 in the three-way um uh domain let's say but we have analyzed uh uh systems characterized by eight nine components namely uh pharmaceutical system characterized by eight8 n components in uh let's say by two-way approaches so standard MCR ALS and we have rather surprising result sold in that case the only thing that
36:00 - 36:30 you will pray the let's say the price off is the uh number of essential Spectra or essential spectral variables that you will select because you will need somehow to increase the number of components uh uh along which you perform the selection to catch all the sources of variabilities underlying your system so this will mean that you can always increase the number of components of course you will pay the price of having
36:30 - 37:00 to analyze a larger number of samples but sometimes we had surprising results also in this uh particular scenario when dealing with minor components that is to say among the seven eight nine components you have sometimes one component that only appears under few pixels I agree with you in that case Le squares based approaches may not handle this kind of minor components because somehow they will adapt uh uh uh uh um
37:00 - 37:30 the the models to the to the major components using this kind of trick of reducing uh the the size of the data it seems at least to many in in many cases we handled that you can approach easily This Global Maxim uh This Global minimum at least in two way data analysis I'm not sure if in threeway that's that that that's the case because I've never let's say we have never collected Ed data character well on on systems
37:30 - 38:00 more complex that's would say four five components actually when dealing with this kind of fluorescence uh techniques is already particularly complicated to get images or or or data uh uh underlying by three or four flors then then that thinking of increasing increasing number of flof fors or chemical species g u spectroscopically active for us is a bit complicated let's
38:00 - 38:30 say thanks thanks Osman okay can I ask you if if I I expect if if your initial PCA or whatever is too small I mean too few components that's going to be an issue yes more more that that's totally correct it's more an underestimation of the complexity that is dangerous there than an overestimation yeah if your rank
38:30 - 39:00 let's say is underestimated when you lose a component and then your selection your reduction let's say will not take into account the influence that that component had in the original data of course you can always have I would that I always say that this problem is important but it's not that important because you can always stay a little bit more conservative and say okay I go on with the number of components I overestimate my rank the price to pay is is simply uh uh I will not reduce the
39:00 - 39:30 data as much uh as in the ideal case but in that case you will be sure that all components contain uh sorry yes contain information related to all your chemical components of your system uh this is this is something we commonly play with and this is actually the main difference you have with these random sampling approaches random sampling yes of course if your comp is distributed randomly or or homogeneously over for example an
39:30 - 40:00 aers spectral image everything is fine but if you have minor components then random sampling might not be the the the ideal solution let's say yeah but how how much I I would kind of guess that when you increase the rank it kind of explodes a little bit the number or what I mean the number of members of the convex Hall it increases it increases it does not EXP I mean you cannot select more than your original data set so if
40:00 - 40:30 you can handle in a certain sense your original data set you might be able to handle Also let's say but it doesn't increase well that's that's something I I've never I've never uh tested if you you you mean if the if the trend of increase is exponential or linear that I don't know to be honest no okay I would just expect it to be more than linear but okay okay don't know maybe Paul has an input Paul um yeah yeah no I I don't really have a
40:30 - 41:00 um an input for that I I had another question um why maybe related to this um because uh have you practically made any experience with the sequentially modeling like if you start with uh three principal components and then you see um yeah the the information that I received is nice but then you continue modeling the data said with the let's say four to
41:00 - 41:30 six principal components that you have not looked into because I think actually you can kind of nicely unpeel the data Maybe in that way because it's all orthogonal right so what regarding this rather than rather than in my in our case is it's much more interesting rather than playing with components of course this is also nice you can also implement meant what is called the convex peeling within the selection that
41:30 - 42:00 is to say instead of selecting all the samples belonging to the first convex tool what you can do is you select the first convex tool you remove then you continue select selecting the convex to so in this case you will reach in a certain sense gradually the most internal area of the Simplex so you will anyway select the P the let's say the purest uh sample columns or tubes available just you
42:00 - 42:30 would be a bit more robust because of course in real case study you might have outliers you might have noise that somehow uh uh generate observations that cannot be described as linear combinations of of pure component Spectra so you will definitely select them but if you play with the sequential convex peeling you can robustify the whole procedure in that sense but um this is let's say some something that we
42:30 - 43:00 did previously in in in a bilinear carve resolution we waited let's say the resolution based on this kind of approach of convex peeling and this somehow robustify the estimates robustify the estimates means that uh your estimates become less uncertain because of course if you reduce uh if you reduce uh too much uh the data well noise will take will take somehow a big
43:00 - 43:30 influence with respect to the structure and of course it can affect can perturbate the uh uncertainty of your estimates but if you play with this kind of convex feeling you can robustify the whole procedure in this sense this is the only sequential modeling strategy that we that we explored so far of course sometimes when you don't have satisfying resolution tion with two three four components you can always try
43:30 - 44:00 to increase the number of components up until somehow the resolution you obtain makes sense to you or can be interpreted by an expert Etc this is of course something that we of we often do when dealing with new data let's say about which we don't know much that's that's my that's my experience on this point I don't know if I answered your your question Paul but yes um yeah I just thought um let a
44:00 - 44:30 practical solution could be you calculate it on three components then you can subtract the three components uh the three the information of the three first principle components from from your data and then just repeat go on and I mean your your strategy of this unpeeling is really you mean you mean about the number of when the number of components increases that's what you mean yeah ah okay so I have I have a a a sensation
44:30 - 45:00 about that because this is based on some test I I preliminary I have preliminary done when you select with three components and you deflate afterwards to select once again on the residual three components I'm not sure that selecting over six component compared to selecting first on the first threee and then on the second tree is the same and you know why
45:00 - 45:30 because geometrically a poly I don't know how to say a polytop in six Dimension cannot be necessarily described as a combination of two polytopes in three dimension H even if even if the dimensions are orthogonal we have experienced that in the sense that if you select we tried that if you select in a Subspace of six principal comp component and you select on all possible combinations of two
45:30 - 46:00 components among the six the selection is not the same okay yeah that answered my question I was Sor I probably your question sorry yeah okay so