I want to start removing my outliers before my model training. In the attached screenshot, how do I modify this single point . I click in the top upper grid thinking I can modify a single cell value by clicking on it, but either that's not an option or I'm doing something wrong. I know I can copy, paste into Excel ? make the mods there, but then I have to reimport the file ? IS there a consolidated workflow that anyone can recommend ?
Also, can anyone suggest a way of preprocessing these outliers out of my source data files that's fairly painless without having to make cell modifications one by one ? I'd love to eventually be able to automate this process if possible.
What do the two pink horizontal lines in the chart represent (the bottom line has been cut off in this screenshot))? They look important :> I just don't see their corresponding values in the right side properties attributes (with min, max, avg etc). Do they represent some sort of standard deviation for that column or ?
In addition, where it says POINT , can you explain how the two outlier drop downs options work (Outliers Error Perc and Outliers Abs Error) and what role does the value input box provide to these two options ?
Thanks
Devon
Hi Devon,
The two pink lines are really important for detecting outliers :) They are the standard deviation lines; the default is 3 sigma, which is a simple method of detecting outliers: any points outside these lines are outliers and you should remove them if they are impacting negatively on your modeling. Through the context menu you can choose to plot 1, 2 or 3 stdev lines. The yellow line is the average, which you can also switch on and off through the context menu.
Now, to remove the outliers, you don’t have a fully automated process yet, but after you spot them using the charts, you then can delete them through the Delete Records Window (it’s accessible through the Data Menu). So it’s fairly simple and fast.
<<In addition, where it says POINT, can you explain how the two outlier drop downs options work (Outliers Error Perc and Outliers Abs Error) and what role does the value input box provide to these two options ?>>
These outliers refer to model output outliers (they are relative to the active model, so you can do interesting analyses with this). In the input box you write the error value you want. There’s also a tooltip over this input box that then tells you how many outliers fit that criterion; they then are shown in red in the charts (line charts and scatter plot).
Candida
Thanks Candida!
That is useful information.
I did discover that I can delete rows but it seems intuitive to be able to easily and quickly modify a cell value as one would do in an Excel file. Question, would deleting rows make any difference modeling sequential time series in any of the non time series functions (regression, classification etc)? Are these modeling types even "aware" of the previous or next rows within the seguence? If not, I guess deleting a row would have little impact.
Maybe if enough of your users would also find the ability to edit individual cell values useful, we could add it to the future upgrades wish list?
Devon Kyle
That’s already on the wish list. We’ll add support for record editing in v6.
<< Question, would deleting rows make any difference modeling sequential time series in any of the non time series functions (regression, classification etc)? Are these modeling types even "aware" of the previous or next rows within the seguence? If not, I guess deleting a row would have little impact.>>
For classification and logistic regression, the answer is no. For regression the answer is also no for most fitness functions. But we now have new fitness functions for regression (and time series prediction too, but the deletion of records in TSP is not supported) that explore in simple ways the time sequence: the Trends series (4 fitness functions) and the Weighted series (9 functions). But even these fitness functions will work fine even if you randomize or delete the inputs, as the time component is just one of the terms.
And before you ask, we will implement the Trends and Weighted series for classification and logistic regression because I see how they can be useful for the kind of datasets you’re dealing with.
Candida
"Trends series (4 fitness functions) and the Weighted series (9 functions). "
Have not used this yet ,and probably would have not discovered those on my own - those will be next on my list of exploration!!
By the way, I discovered the (new?) normalization features a couple days ago which I am LOVING !!!
"And before you ask"..
LOL - I have this uneasy feeling you are starting to us analytics to predict my questions !!
You are a wealth of great information Candida !! :>
I see the Weighted MSE ...etc functions. Those make sense as in .. the most recent data is more "important" than older data I assume ,like a weighted moving avg... but the
trends fitness functions... Trends with Punishment ??? Defaulted to -10..Is there a simple high level explanation for the "Punishment" property ?? My imagination is running rampant..
(You're probably thinking "Yes Devon, Punishment is having to respond to all your endless questions! " :)
Have no fear - I am running out of topics/questions to ask about!! :>
<< trends fitness functions... Trends with Punishment ??? Defaulted to -10..Is there a simple high level explanation for the "Punishment" property ?? My imagination is running rampant.>>
The trends series have two trends components, besides the error measure which is the RRSE. The first trend component rewards correct model movements (i.e. moves in the same direction as the target, up or down) and punishes otherwise. The second component rewards correct transitions (up/down and down/up transitions) and punishes wrong ones. The choice of -10 for the punishment is a value that works well for most cases, but experiment away. These fitness functions are hard to fine tune because of the number of components involved, and the error measure has to have a very heavy influence, otherwise nothing would work. So don’t expect it to just go and hug the curve (would have been nice though)! But it will try its best…
Candida
@Candida - I can actually understand that!! Will be experimenting with these all weekend. Thx
Questions
When installing new updates, is it recommended to uninstall the original install or can I just "ride over it" with the new build ?
Also - Will there, or is it even possible, for the Time Series Run Category to ever be able to allow more than a single column/ variable or is it fix to a single column due to the nature of the algo ?
I'd love to be able to experiment with that model Category with multi input variables - especially for market data where attributes like Volume are very important and ..is there a clever work around for creating this for the current live version until that , if even possible, it became a built in feature in a future version.
