Parsing Float [Java Regex]
type568

I'm trying to compile a String which would represent a Floating point number. I'm aware I can parse it with Float.parseFloat(), which I even plan to do. But before I get to the parsing I need to make sure I'm to parse something I will successfully parse.

Here's what I'm doing:

public static final String P_PRICE = "(\\d+.\\d+?)";
public static final String P_CHANGE = "\\((\-?\\d+.\\d+?)\\)";
public static final String P_UNS= "(\\d+)";

Pattern p = Pattern.compile(P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_CHANGE+" "+P_UNS+" "+P_UNS+"."+P_UNS+"."+P_UNS);
        Matcher m=p.matcher(line);

It works fine, until:

Tue\ Oct\ 11\ 13\:33\:02\ EEST\ 2016\ Line0=DataManager.NewCandle.NewCandle\: Failed to parse line\:\n0.071 0.06959 0.07018 0.07015 (-5.8E-4) 7181160800 07.09.2016\n

The issue here is that "*(-5.8E-4)*" isn't caught by P_CHANGE pattern, although it's a representation of a float.

String str=(-5.8E-4);
System.out.println(Float.parseFloat(str));

Does print appropriate result.
I wasted quote some time, and googled around but I can't seem to glue a pattern I need.

Now, this:

"(\-?\\d+.\\d+E\-?\\d+)"

Seems to scan the likes of -X.XE-X, but: I need to OR it somehow with the P_PRICE stuff I use. And I'm quite clueless :(

amarillion

How about something like:

"(-?\\d+.\\d+(E-?\\d+)?)"

In Java, prices are generally much better represented by BigDecimal than by float though.

A value like "0.3" can never by exactly represented by a float. That is because that value can't be represented exactly in binary form. Try this online converter to convert decimal 0.3 to binary to see what I mean:

http://www.binaryconvert.com/result_float.html?decimal=048046051

On the other hand, the rounding error would allow you to set up a scam like they proposed in Office Space :)

type568

TLDR
Move to append.

Thank you Soviet cosmonaut. :)

You're absolutely right about BigDecimal, but I'm not installing a platform to host any kind of trading, I'm just busy with analysis. I'm ready to sacrifice the precision for the sake of ease of coding, and faster performance. I plan to teach this thing using DNA algorithm later on.

Also I'm aware float sometimes behaves weirdly with output, now I know exactly why. Probably I did study it, but it didn't stay in the head. The 0.3 is an awesome example, and you gave a very good explanation. Thank you.

Now about the pattern. Unfortunately, it's more complex than your suggestion :(
It kind of works, but not exactly. Here's a chunk of code, it's quite big, but couldn't figure it any smaller.. It isn't part of my program, it's made just for this test:

Patterns, my old ones & yours.

    public static final String P_PRICE_NEW = "(-?\\d+.\\d+(E-?\\d+)?)";
    public static final String P_CHANGE_NEW = "\\("+P_PRICE_NEW+"\\)";
    
    public static final String P_PRICE = "(\\d+.\\d+?)"; //(-?\\d+.\\d+(E-?\\d+)?)
    public static final String P_CHANGE = "\\((\-?\\d+.\\d+?)\\)";
    
    public static final String P_UNS= "(\\d+)";

The test code:

#SelectExpand
1String line="139.8"; 2 Pattern p = Pattern.compile(P_PRICE_NEW); 3 Matcher m=p.matcher(line); 4 5 if(m.find()){ 6 for(int i=0;i<m.groupCount();i++) 7 System.out.println(m.group(i)); 8 } 9 System.out.println("|||||||||||"); 10 11 line="139.8 138.18 138.98 139.14 (0.0) 32335800 11.09.2016"; 12 p = Pattern.compile(P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_CHANGE_NEW+" "+P_UNS+" "+P_UNS+"."+P_UNS+"."+P_UNS); 13 m=p.matcher(line); 14 15 if(m.find()){ 16 for(int i=0;i<m.groupCount();i++) 17 System.out.println(m.group(i)); 18 } 19 System.out.println("|||||||||||"); 20 21 22 line="139.8 138.18 138.98 139.14 (0.0) 32335800 11.09.2016"; 23 p = Pattern.compile(P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_CHANGE+" "+P_UNS+" "+P_UNS+"."+P_UNS+"."+P_UNS); 24 m=p.matcher(line); 25 26 if(m.find()){ 27 for(int i=0;i<m.groupCount();i++) 28 System.out.println(m.group(i)); 29 }

The output:

#SelectExpand
1139.8 2139.8 3||||||||||| 4139.8 138.18 138.98 139.14 (0.0) 32335800 11.09.2016 5139.8 6null 7138.18 8null 9138.98 10null 11139.14 12null 130.0 14null 1532335800 1611 1709 18||||||||||| 19139.8 138.18 138.98 139.14 (0.0) 32335800 11.09.2016 20139.8 21138.18 22138.98 23139.14 240.0 2532335800 2611 2709

Here we can see your pattern is good in the first line, but not good when we parse the real case.

While my old code does handle this stuff.

Append:
I figured the issue, but not how to solve it.
My line to print, didn't print it all:

if(m.find()){
            for(int i=0;i<m.groupCount();i++)
                   System.out.println(m.group(i));
        }

Is incorrect, correct is this:
for(int i=0;i<m.groupCount()+1;i++)

Now I just know your pattern breeds an extra group, so this:

    public static final String P_PRICE_NEW = "(-?\\d+.\\d+(E-?\\d+)?)";

Isn't a solution to identify any float :(

Append1:
I upgraded your regex by turning the second group in to a "non grouping group" using a ?: .

Oh, & I also let it catch integers.. Int is also a parseable float I believe.
Although float output will use a .0, oh well..
So it goes like this:

    public static final String P_PRICE = "(-?\\d+(?:.\\d+)(?:E-?\\d+)?)";
    public static final String P_CHANGE = "\\("+P_PRICE+"\\)";

bamccaig

First you have to define every possible pattern that could be parsed as a float... This is no small undertaking... If possible, it's best left to the type author. You appear to want to defer to the standard implementation of Float.parseFloat(). I'm left wondering why you can't utilize exception handling for this purpose:

Float x = null;

try {
    x = Float.parseFloat(input);
} catch (Exception e) {
    /// What now? You decide or...
    x = 0;
}

Note: I'm a rusty Java amateur. I mostly write C#. In any case, I imagine Java should be similar. What stops you from letting the Float class do its thing and relying on exception handling to catch its failures?

Failing that, isn't the Java source code open source? Look at how it does it for inspiration (or if its license is compatible, copy it).

type568

Generally, it's done(problem solved).
What' you are suggesting isn't exactly a solution to my deeds, as I want to grab the specific substring of a larger string, which I use the regex for. And I want to be sure all of the values I need from the String are present in the String, as if not it's not the String I want.

Generally it could be parsed somehow like "from space to space", which I did ages ago. Clearly regex is the way to go, as it offers a lot more elegant solution.

Generally my code related to the topic is this:

#SelectExpand
1public static final String P_PRICE = "(-?\\d+(?:.\\d+)(?:E-?\\d+)?)"; 2public static final String P_CHANGE = "\\("+P_PRICE+"\\)"; 3public static final String P_UNS= "(\\d+)"; 4 5public NewCandle(String open,String high,String low,String close,String change,String volume,String day,String month,String year){ 6 this(Float.parseFloat(open),Float.parseFloat(high),Float.parseFloat(low),Float.parseFloat(close),Float.parseFloat(change),Long.parseLong(volume),Integer.parseInt(day),Integer.parseInt(month),Integer.parseInt(year)); 7 } 8 9public static NewCandle candleFromString(String line){ 10 Pattern p = Pattern.compile(P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_CHANGE+" "+P_UNS+" "+P_UNS+"."+P_UNS+"."+P_UNS); 11 Matcher m=p.matcher(line); 12 13 if(m.find()) 14 return new NewCandle(m.group(1),m.group(2),m.group(3),m.group(4),m.group(5),m.group(6),m.group(7),m.group(8),m.group(9)); 15 else{ 16 GA.GA.reportError("DataManager.NewCandle.NewCandle: Failed to parse line:\n"+line); 17 return new NewCandle(null,null,null,null,null,null,null,null,null); 18 } 19 }

Readable, and simple. Just the regex is something that requires some depth in it, but it's quite encapsulated.

bamccaig
type568 said:

public static final String P_PRICE = "(-?\\d+(?:.\\d+)(?:E-?\\d+)?)";

Unfortunately ML's markup tends to eat backslashes that aren't intended to be eaten. I'm assuming that dot (.) is prefix with an escape (\)?

Otherwise, I agree, a regex is a good solution to this problem. If supported by Java's regex/string literal, you should consider using insignificant white-space/comments with the regex to explain it. It might be readable and make sense now, but in 6 months you'd be surprised. :P

type568

I'm already surprised BamBam, doesn't take six months :D

No, ML didn't eat anything which didn't belong to him.

All this: (?:.\\d+) is under a question, the ?: means previous char may not appear, and if its an opening bracket all of it's content may not appear. This is so that we could accept an integer as a float(as it's parse-able by a Float.parseFloat()).

If we already in the second bracket, we don't question the dot, it must be here. And it doesn't require escape character. The \\d+ kinda does, as it's not a d character, but a number of digits.

About whitespaces though.. Uhm. Where? If I add white spaces in to the pattern, it will be actually looking for whitespaces, and there won't be I'm afraid..

About comments. Well, go explain it :S
I'm unsure how to. Regex is regex, guess some word description could assist in understanding. IDK.

I read what I wrote, I understand it. If not clear I can clarify though. If you wanna bother understanding. :)

bamccaig

I'm basing this off of Perl 5 experience and cross-referencing Java documentation[1]. In Perl, (?:pattern) means pattern is a "non-capturing" group. It still has to match, but it isn't part of the "output" capture groups of the regex. The parens are used only to group the pattern for other reasons, like internal operators, or to apply an operator to the whole thing (e.g., ?). From the sounds of it, the same is true in Java. Similarly, with such a non-capture group in Perl the dot character (.) has no special significance: it still represents "any" character. It sounds as though the same is true in Java. Have you thoroughly tested your regex with all anticipated inputs and garbage? It's possible I'm misunderstanding the documentation or perhaps the regex library you're using is different than the one I'm reading about. Off the top of my head, the Perl regex for this solution would probably be:

#SelectExpand
1my $num = qr/[0-9]/; # Numeral (in Perl \d includes more than just 0-9 [some of the time]). 2my $sign = qr/[+-]?/; # Optional sign (remove + if desired). 3my $dot = qr/[.,]/; # Either . or , (I understand some regions use comma?). 4 5my $re = qr/ 6 $sign 7 (?: # "00" or "00." or "00.00": 8 $num+ # Required numerals. 9 (?: # Optional decimal part: 10 $dot # Required dot. 11 $num* # Optional after-part. 12 )? 13 | # OR ".00": (NEW) 14 (?: # Just the decimal part: 15 $dot # Required dot. 16 $num+ # Required numerals. 17 )? 18 ) 19 (?: # Optional scientific e[xponent] notation: 20 [eE] # Character e (case-insensitive). 21 $sign 22 $num+ # Numerals. 23 )? 24 /x;

(Untested)

In Perl, qr// (quote-regex) is a way of storing a regular expression in a variable. The x modifier (/x) allows white-space and comments to be ignored within the pattern itself. If you actually intend for white-space you need to use \s or [ ] or \t, etc.

As you can see, this allows you to document each part of the regex and explain it. This comes in handy because as above where I've added support for ".000" (which may or may not meet your spec.) things tend to get complicated with regular expressions. I've opted to move a couple of repetitive concepts into a variable to avoid repeating myself. Arguably, in this case that might be more cryptic than repeating yourself, but it has some advantages for some complicated regexps.

Please explain if I'm mistaken. It never hurts to familiarize yourself with another regex engine. :) For the record, I personally think that everybody should learn regex in Perl to experience what it's really capable of. ;)

type568

Note: The first part of the post contains an error which was found during creation of the post, and is explained and corrected later.

The issue here, is that I'm badly familiar with regex, as well.. There wouldn't be this post. Nevertheless, I'm quite sure the dot is just a dot. It's not any char, it's just a dot. And only dot I want in this float, perhaps if I added it with an escape, it could be something special like "any char".

You're correct, I forgot to mention the (:?) is a non grouping group, but in addition to that thanks to the ? it's a group that doesn't have to be there.

Here's some code instead of a thousand words:

public static final String P_PRICE_NEW = "(-?\\d+(?:.\\d+)(?:E-?\\d+)?)";

        String line="139.8E-2 124 -123.0";
        Pattern p = Pattern.compile(P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW);
        Matcher m=p.matcher(line);
        if(m.find()){
            for(int i=0;i<=m.groupCount();i++)
            System.out.println(m.group(i));
        }

& the output:

139.8E-2 124 -123.0
139.8E-2
124
-123.0

Generally the only thing that is a must here is a d+.

Hmm. I'll try to explain it actually, unsure how it's implemented I'll use my imagination. Good exercise anyways:

code: irrelevant apparently, as it'd be wrong :)

P.S:
After typing quite some stuff, and then actually testing it I realized I was wrong. I'm unsure whether or not it's a good idea to delete my errors, as it's a good display of flight of thought. I'll just add a note to beginning of the post. Thank you for your question & suggestion as it forced me learn the subject better. And well, it made me to correct the pattern to be more precise, and not just "good enough".

Well, after writing this:

Quote:

Nevertheless, I'm quite sure the dot is just a dot. It's not any char, it's just a dot. And only dot I want in this float, perhaps if I added it with an escape, it could be something special like "any char".

I had to actually make sure it's as is say, right? Well, it isn't. :)
You're correct. But ML still didn't eat anything.

Oh, and you're right about non grouping group as well, that's all :? does.

Here's the right code, with right explanation:

//////////////////////////////////////////01122223334444444336667777777660
public static final String P_PRICE_NEW = "(-?\\d+(?:\\.\\d+)?(?:E-?\\d+)?)";
    /*
    0- This is the main group, which includes it all.
    1- This allows the expression to either start or not start with a '-'
    2- This forces a larger than 0 amount of digits to be present.
    3- This creates a non-grouping group, which may or may not be, with 4 inside of it.
    4- If group 3 is created, inside of it has to be a '.', and a positive count of digits.
    6- Creates the final optional non grouping group,  with contents of 7 inside of it. 
    7- If created has to have an E, which maybe followed by a -, and then has 
    to be followed by a positive count of digits.
    */

This pattern also accepts above mentioned String, and produces the same output. But this one would filter away floats if they were to be written somehow like this: 12;45.

It'll also accept a single digit float I believe, while previous pattern would probably require at least two digits, and an any char between them(a digit would also do).

Append:
Well, now to be completely sure I also tested this one:

String line="139.8E-2 124 -123.0 4";
Pattern p = Pattern.compile(P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW);

Groups as intended.

String line="139.8E-2 124 4.1.4 -123.0";
        Pattern p = Pattern.compile(P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW);

This doesn't find. By the way, I always use a find rather than a match. So unless the 4.1.4 was moved inside of it, the pattern was also found & grouped, just dumping the redundant .4 after 4.1.

bamccaig

That looks good to me. 8-) Nevertheless, it's complicated enough that I recommend you setup a test program that feeds various good and bad inputs to it and verifies that it matches what it should and doesn't match what it shouldn't. That's really the best way to deal with regular expressions. Test them with enough data that if there's a bug it will get caught.

I also like your indexing scheme. I'm not sure if Java supports something like Perl's /x, but if not I think you've come up with a satisfactory way to document a regex otherwise. :)

Append:

Something like 4.1.4 shows you just how complicated precise regular expressions need to be. It's easy to match the basic cases. It's hard to match all of the good and none of the bad. Something like 4.1.4 might be solved with a negative lookahead added to the end. Something like: (?!>\.\d*). Or it might suffice to require either white-space or the end of the line instead after the number: (?:\s+|$). Ultimately it all comes down to the business requirements. What input is possible and how you want to handle invalid input. In any case, I think you've got a good handle on things now.

type568

Again thanks for your suggestion, I don't think I'll bother though. It matches what it has to match(couple thousands lines are parsed just fine), and if it misses something it shouldn't let in.. Whatever. :)

Generally it should never be fed anything wrong anyways, and all it does if it's fed a not matching string is a crash. But the only strings are feed to it, are those produced with its .toString(). Generally I overshot the regex objective anyways, but it's good to learn something new.

About it missing the final 4.1.4, well. If I would want it NOT to let this happen, i wouldn't use .find() on a matcher, but I would use a .match() instead. But actually should I decide to append a comment to a line, well.. Why not. I also am not willing to mess with the various newlines which encounter here and there and complicate the day. So .find() is good enough for me.

Now about data though. I tested this thing manually of course, but: do you mean I should feed it with something massive & generic in order to comply with today coding standards, perhaps read from some file, or.. How would you suggest it done?

Two more things:
Thanks for your admiration of my documentation, and about testing:
My program loads data from the net, then stores it locally.
Then it loads the data from files in to a separate instance of a class.
Then it compares string representation of the classes(which visually seem to be correct in the file).

Due to the fact all data matches, I do believe it works sufficiently well, even if it is possible to crash it, or cheat it if one messes with the files. If you ruin the data you can crash the program.. Fine for me, is there a program won't crash then?
By the way. By default it saves two identical files representing the data, that is in case one was corrupted.

Edit: readability.

Thread #616530. Printed from Allegro.cc