Outliers
  • I want to start removing my outliers before my model training. In the attached screenshot, how do I modify this single point . I click in the top upper grid thinking I can modify a single cell value by clicking on it, but either that's not an option or I'm doing something wrong. I know I can copy, paste into Excel ? make the mods there, but then I have to reimport the file ? IS there a consolidated workflow that anyone can recommend ?

    Also, can anyone suggest a way of preprocessing these outliers out of my source data files that's fairly painless without having to make cell modifications one by one ? I'd love to eventually be able to automate this process if possible.

    What do the two pink horizontal lines in the chart represent  (the bottom line has been cut off in this screenshot))?  They look important :> I just don't see their corresponding values in the right side properties attributes (with min, max, avg etc). Do they represent some sort of standard deviation for that column or ?

     

     In addition, where it  says POINT , can you explain how the two outlier drop downs options work (Outliers Error Perc and Outliers Abs Error) and what role does the value input box provide to these two options ?

     

     

    Thanks

    Devon

     

     

     

  • Hi Devon,


    The two pink lines are really important for detecting outliers :) They are the standard deviation lines; the default is 3 sigma, which is a simple method of detecting outliers: any points outside these lines are outliers and you should remove them if they are impacting negatively on your modeling. Through the context menu you can choose to plot 1, 2 or 3 stdev lines. The yellow line is the average, which you can also switch on and off through the context menu.


    Now, to remove the outliers, you don’t have a fully automated process yet, but after you spot them using the charts, you then can delete them through the Delete Records Window (it’s accessible through the Data Menu). So it’s fairly simple and fast.


    <<In addition, where it says POINT, can you explain how the two outlier drop downs options work (Outliers Error Perc and Outliers Abs Error) and what role does the value input box provide to these two options ?>>


    These outliers refer to model output outliers (they are relative to the active model, so you can do interesting analyses with this). In the input box you write the error value you want. There’s also a tooltip over this input box that then tells you how many outliers fit that criterion; they then are shown in red in the charts (line charts and scatter plot).


    Candida

  • Thanks Candida!

    That is useful information.


    I did discover that I can delete rows  but it seems intuitive to be able to easily and quickly modify a cell value as one would do in an Excel file. Question, would deleting rows make any difference modeling sequential time series in any of the non time series functions (regression, classification etc)? Are these modeling types even "aware" of the previous or next rows within the seguence? If not, I guess deleting a row would have little impact.

      Maybe if enough of your users would also find the ability to edit individual cell values useful, we could add it to the future upgrades wish list?

     

    Devon Kyle

  • That’s already on the wish list. We’ll add support for record editing in v6.


    << Question, would deleting rows make any difference modeling sequential time series in any of the non time series functions (regression, classification etc)? Are these modeling types even "aware" of the previous or next rows within the seguence? If not, I guess deleting a row would have little impact.>>


    For classification and logistic regression, the answer is no. For regression the answer is also no for most fitness functions. But we now have new fitness functions for regression (and time series prediction too, but the deletion of records in TSP is not supported) that explore in simple ways the time sequence: the Trends series (4 fitness functions) and the Weighted series (9 functions). But even these fitness functions will work fine even if you randomize or delete the inputs, as the time component is just one of the terms.


    And before you ask, we will implement the Trends and Weighted series for classification and logistic regression because I see how they can be useful for the kind of datasets you’re dealing with.


    Candida

  • "Trends series (4 fitness functions) and the Weighted series (9 functions). "
    Have not used this yet ,and probably would have not discovered those on my own -  those will be next on my list of exploration!!

    By the way, I discovered the (new?) normalization features a couple days ago which I am LOVING !!!

     

    "And before you ask"..
    LOL - I have this uneasy feeling you are starting to us analytics to predict my questions !!

    You are a wealth of great information Candida  !!  :>

  • I see the Weighted MSE ...etc functions. Those make sense as in .. the most recent data is more "important" than older data I assume ,like a weighted moving avg... but the

    trends fitness functions... Trends with Punishment ??? Defaulted to -10..Is there a simple high level explanation for the "Punishment" property ?? My imagination is running rampant..


    (You're probably thinking  "Yes Devon, Punishment is having to respond to all your endless questions! " :)
    Have no fear - I am running out of topics/questions to ask about!! :>

  • Given that this is the second thread about deleting outliers, we created a new feature to help with this task. It is not yet the full blown delete outliers we have in mind for the next version but just a workaround to help you get rid of those pesky outliers.

    We added a new entry to the context menu of the Sequential Distribution Chart (Variables). This entry is called Copy Outlier IDs (3 Sigma) and, as the name indicates, it copies the IDs of the records that are outside the 3 Sigma boundaries. They are copied in the format expected by the Delete record window so removing outliers is as easy as copying them in the chart, opening the Delete Records window and pressing OK. It does this for one variable at a time.

    There is a new version for download at the site with this new feature and a few bug fixes.

    I hope you find this useful!
    Jose

  • "pesky outliers" - that's a very polite way of characterizing them :>

     

    Will download and test this weekend Jose   - Tx !

  • << trends fitness functions... Trends with Punishment ??? Defaulted to -10..Is there a simple high level explanation for the "Punishment" property ?? My imagination is running rampant.>>


    The trends series have two trends components, besides the error measure which is the RRSE. The first trend component rewards correct model movements (i.e. moves in the same direction as the target, up or down) and punishes otherwise. The second component rewards correct transitions (up/down and down/up transitions) and punishes wrong ones. The choice of -10 for the punishment is a value that works well for most cases, but experiment away. These fitness functions are hard to fine tune because of the number of components involved, and the error measure has to have a very heavy influence, otherwise nothing would work. So don’t expect it to just go and hug the curve (would have been nice though)! But it will try its best…


    Candida

  • @Candida - I can actually understand that!! Will be experimenting with these all weekend. Thx

    Questions

    When installing new updates, is it recommended to uninstall the original install or can I just "ride over it" with the new build ?

     

    Also - Will there, or is it even possible,  for the Time Series Run Category to  ever be able to allow more than a single column/ variable or is it fix to a single column due to the nature of the algo ?

    I'd love to be able to experiment with that model Category  with multi input variables - especially for market data where attributes like Volume are very important and ..is there a clever work around  for creating this for the current live version  until that , if even possible, it became a built in feature in a future version.

  • Devon,

    Unfortunately you will have to do the uninstall and re-install. "Riding over" is dangerous in that some file might not be replaced. Still, there is an easier way. Create a batch file (I named mine update.bat) and add these lines:

    CALL "C:\Program Files (x86)\GeneXproTools 50\unins000.exe" /SILENT
    PAUSE
    CALL GeneXproToolsSetup.exe /SILENT

    If your OS is 32 bit then replace Program Files (x86) with Program Files. Drop the batch file next to the installer and run the batch file (double click comes to mind). It will uninstall the current installation then pauses (for some reason I don't recall it fails without the pause), hit enter and off it goes, it does an unattended installation. This will help until we start doing automated updates (no promises though).

    Regarding the Time Series Run Category. For the moment there is no easy way to do multi variable time series runs. The next version will see big changes in this area and, although it is too early to be able to give details, this is in our list.

  • Glad I asked! Thanks Jose...

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Tagged