|
Parsing Float [Java Regex] |
type568
Member #8,381
March 2007
|
I'm trying to compile a String which would represent a Floating point number. I'm aware I can parse it with Float.parseFloat(), which I even plan to do. But before I get to the parsing I need to make sure I'm to parse something I will successfully parse. Here's what I'm doing: public static final String P_PRICE = "(\\d+.\\d+?)"; public static final String P_CHANGE = "\\((\-?\\d+.\\d+?)\\)"; public static final String P_UNS= "(\\d+)"; Pattern p = Pattern.compile(P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_CHANGE+" "+P_UNS+" "+P_UNS+"."+P_UNS+"."+P_UNS); Matcher m=p.matcher(line);
It works fine, until: Tue\ Oct\ 11\ 13\:33\:02\ EEST\ 2016\ Line0=DataManager.NewCandle.NewCandle\: Failed to parse line\:\n0.071 0.06959 0.07018 0.07015 (-5.8E-4) 7181160800 07.09.2016\n The issue here is that "*(-5.8E-4)*" isn't caught by P_CHANGE pattern, although it's a representation of a float. String str=(-5.8E-4); System.out.println(Float.parseFloat(str));
Does print appropriate result. Now, this: "(\-?\\d+.\\d+E\-?\\d+)"
Seems to scan the likes of -X.XE-X, but: I need to OR it somehow with the P_PRICE stuff I use. And I'm quite clueless
|
amarillion
Member #940
January 2001
|
How about something like: "(-?\\d+.\\d+(E-?\\d+)?)" In Java, prices are generally much better represented by BigDecimal than by float though. A value like "0.3" can never by exactly represented by a float. That is because that value can't be represented exactly in binary form. Try this online converter to convert decimal 0.3 to binary to see what I mean: http://www.binaryconvert.com/result_float.html?decimal=048046051 On the other hand, the rounding error would allow you to set up a scam like they proposed in Office Space -- |
type568
Member #8,381
March 2007
|
TLDR Thank you Soviet cosmonaut. You're absolutely right about BigDecimal, but I'm not installing a platform to host any kind of trading, I'm just busy with analysis. I'm ready to sacrifice the precision for the sake of ease of coding, and faster performance. I plan to teach this thing using DNA algorithm later on. Also I'm aware float sometimes behaves weirdly with output, now I know exactly why. Probably I did study it, but it didn't stay in the head. The 0.3 is an awesome example, and you gave a very good explanation. Thank you. Now about the pattern. Unfortunately, it's more complex than your suggestion Patterns, my old ones & yours. public static final String P_PRICE_NEW = "(-?\\d+.\\d+(E-?\\d+)?)"; public static final String P_CHANGE_NEW = "\\("+P_PRICE_NEW+"\\)"; public static final String P_PRICE = "(\\d+.\\d+?)"; //(-?\\d+.\\d+(E-?\\d+)?) public static final String P_CHANGE = "\\((\-?\\d+.\\d+?)\\)"; public static final String P_UNS= "(\\d+)"; The test code: 1String line="139.8";
2 Pattern p = Pattern.compile(P_PRICE_NEW);
3 Matcher m=p.matcher(line);
4
5 if(m.find()){
6 for(int i=0;i<m.groupCount();i++)
7 System.out.println(m.group(i));
8 }
9 System.out.println("|||||||||||");
10
11 line="139.8 138.18 138.98 139.14 (0.0) 32335800 11.09.2016";
12 p = Pattern.compile(P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_CHANGE_NEW+" "+P_UNS+" "+P_UNS+"."+P_UNS+"."+P_UNS);
13 m=p.matcher(line);
14
15 if(m.find()){
16 for(int i=0;i<m.groupCount();i++)
17 System.out.println(m.group(i));
18 }
19 System.out.println("|||||||||||");
20
21
22 line="139.8 138.18 138.98 139.14 (0.0) 32335800 11.09.2016";
23 p = Pattern.compile(P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_CHANGE+" "+P_UNS+" "+P_UNS+"."+P_UNS+"."+P_UNS);
24 m=p.matcher(line);
25
26 if(m.find()){
27 for(int i=0;i<m.groupCount();i++)
28 System.out.println(m.group(i));
29 }
The output: 1139.8
2139.8
3|||||||||||
4139.8 138.18 138.98 139.14 (0.0) 32335800 11.09.2016
5139.8
6null
7138.18
8null
9138.98
10null
11139.14
12null
130.0
14null
1532335800
1611
1709
18|||||||||||
19139.8 138.18 138.98 139.14 (0.0) 32335800 11.09.2016
20139.8
21138.18
22138.98
23139.14
240.0
2532335800
2611
2709
Here we can see your pattern is good in the first line, but not good when we parse the real case. While my old code does handle this stuff. Append: if(m.find()){ for(int i=0;i<m.groupCount();i++) System.out.println(m.group(i)); }
Is incorrect, correct is this: Now I just know your pattern breeds an extra group, so this: public static final String P_PRICE_NEW = "(-?\\d+.\\d+(E-?\\d+)?)"; Isn't a solution to identify any float Append1: Oh, & I also let it catch integers.. Int is also a parseable float I believe. public static final String P_PRICE = "(-?\\d+(?:.\\d+)(?:E-?\\d+)?)"; public static final String P_CHANGE = "\\("+P_PRICE+"\\)";
|
bamccaig
Member #7,536
July 2006
|
First you have to define every possible pattern that could be parsed as a float... This is no small undertaking... If possible, it's best left to the type author. You appear to want to defer to the standard implementation of Float.parseFloat(). I'm left wondering why you can't utilize exception handling for this purpose: Float x = null; try { x = Float.parseFloat(input); } catch (Exception e) { /// What now? You decide or... x = 0; } Note: I'm a rusty Java amateur. I mostly write C#. In any case, I imagine Java should be similar. What stops you from letting the Float class do its thing and relying on exception handling to catch its failures? Failing that, isn't the Java source code open source? Look at how it does it for inspiration (or if its license is compatible, copy it). -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
type568
Member #8,381
March 2007
|
Generally, it's done(problem solved). Generally it could be parsed somehow like "from space to space", which I did ages ago. Clearly regex is the way to go, as it offers a lot more elegant solution. Generally my code related to the topic is this: 1public static final String P_PRICE = "(-?\\d+(?:.\\d+)(?:E-?\\d+)?)";
2public static final String P_CHANGE = "\\("+P_PRICE+"\\)";
3public static final String P_UNS= "(\\d+)";
4
5public NewCandle(String open,String high,String low,String close,String change,String volume,String day,String month,String year){
6 this(Float.parseFloat(open),Float.parseFloat(high),Float.parseFloat(low),Float.parseFloat(close),Float.parseFloat(change),Long.parseLong(volume),Integer.parseInt(day),Integer.parseInt(month),Integer.parseInt(year));
7 }
8
9public static NewCandle candleFromString(String line){
10 Pattern p = Pattern.compile(P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_PRICE+" "+P_CHANGE+" "+P_UNS+" "+P_UNS+"."+P_UNS+"."+P_UNS);
11 Matcher m=p.matcher(line);
12
13 if(m.find())
14 return new NewCandle(m.group(1),m.group(2),m.group(3),m.group(4),m.group(5),m.group(6),m.group(7),m.group(8),m.group(9));
15 else{
16 GA.GA.reportError("DataManager.NewCandle.NewCandle: Failed to parse line:\n"+line);
17 return new NewCandle(null,null,null,null,null,null,null,null,null);
18 }
19 }
Readable, and simple. Just the regex is something that requires some depth in it, but it's quite encapsulated.
|
bamccaig
Member #7,536
July 2006
|
type568 said:
public static final String P_PRICE = "(-?\\d+(?:.\\d+)(?:E-?\\d+)?)";
Unfortunately ML's markup tends to eat backslashes that aren't intended to be eaten. I'm assuming that dot (.) is prefix with an escape (\)? Otherwise, I agree, a regex is a good solution to this problem. If supported by Java's regex/string literal, you should consider using insignificant white-space/comments with the regex to explain it. It might be readable and make sense now, but in 6 months you'd be surprised. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
type568
Member #8,381
March 2007
|
I'm already surprised BamBam, doesn't take six months No, ML didn't eat anything which didn't belong to him. All this: (?:.\\d+) is under a question, the ?: means previous char may not appear, and if its an opening bracket all of it's content may not appear. This is so that we could accept an integer as a float(as it's parse-able by a Float.parseFloat()). If we already in the second bracket, we don't question the dot, it must be here. And it doesn't require escape character. The \\d+ kinda does, as it's not a d character, but a number of digits. About whitespaces though.. Uhm. Where? If I add white spaces in to the pattern, it will be actually looking for whitespaces, and there won't be I'm afraid.. About comments. Well, go explain it :S I read what I wrote, I understand it. If not clear I can clarify though. If you wanna bother understanding.
|
bamccaig
Member #7,536
July 2006
|
I'm basing this off of Perl 5 experience and cross-referencing Java documentation[1]. In Perl, (?:pattern) means pattern is a "non-capturing" group. It still has to match, but it isn't part of the "output" capture groups of the regex. The parens are used only to group the pattern for other reasons, like internal operators, or to apply an operator to the whole thing (e.g., ?). From the sounds of it, the same is true in Java. Similarly, with such a non-capture group in Perl the dot character (.) has no special significance: it still represents "any" character. It sounds as though the same is true in Java. Have you thoroughly tested your regex with all anticipated inputs and garbage? It's possible I'm misunderstanding the documentation or perhaps the regex library you're using is different than the one I'm reading about. Off the top of my head, the Perl regex for this solution would probably be: 1my $num = qr/[0-9]/; # Numeral (in Perl \d includes more than just 0-9 [some of the time]).
2my $sign = qr/[+-]?/; # Optional sign (remove + if desired).
3my $dot = qr/[.,]/; # Either . or , (I understand some regions use comma?).
4
5my $re = qr/
6 $sign
7 (?: # "00" or "00." or "00.00":
8 $num+ # Required numerals.
9 (?: # Optional decimal part:
10 $dot # Required dot.
11 $num* # Optional after-part.
12 )?
13 | # OR ".00": (NEW)
14 (?: # Just the decimal part:
15 $dot # Required dot.
16 $num+ # Required numerals.
17 )?
18 )
19 (?: # Optional scientific e[xponent] notation:
20 [eE] # Character e (case-insensitive).
21 $sign
22 $num+ # Numerals.
23 )?
24 /x;
(Untested) In Perl, qr// (quote-regex) is a way of storing a regular expression in a variable. The x modifier (/x) allows white-space and comments to be ignored within the pattern itself. If you actually intend for white-space you need to use \s or [ ] or \t, etc. As you can see, this allows you to document each part of the regex and explain it. This comes in handy because as above where I've added support for ".000" (which may or may not meet your spec.) things tend to get complicated with regular expressions. I've opted to move a couple of repetitive concepts into a variable to avoid repeating myself. Arguably, in this case that might be more cryptic than repeating yourself, but it has some advantages for some complicated regexps. Please explain if I'm mistaken. It never hurts to familiarize yourself with another regex engine. For the record, I personally think that everybody should learn regex in Perl to experience what it's really capable of. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
type568
Member #8,381
March 2007
|
Note: The first part of the post contains an error which was found during creation of the post, and is explained and corrected later. The issue here, is that I'm badly familiar with regex, as well.. There wouldn't be this post. Nevertheless, I'm quite sure the dot is just a dot. It's not any char, it's just a dot. And only dot I want in this float, perhaps if I added it with an escape, it could be something special like "any char". You're correct, I forgot to mention the (:?) is a non grouping group, but in addition to that thanks to the ? it's a group that doesn't have to be there. Here's some code instead of a thousand words: & the output: 139.8E-2 124 -123.0 139.8E-2 124 -123.0 Generally the only thing that is a must here is a d+. Hmm. I'll try to explain it actually, unsure how it's implemented I'll use my imagination. Good exercise anyways: code: irrelevant apparently, as it'd be wrong :)
P.S: Well, after writing this: Quote: Nevertheless, I'm quite sure the dot is just a dot. It's not any char, it's just a dot. And only dot I want in this float, perhaps if I added it with an escape, it could be something special like "any char".
I had to actually make sure it's as is say, right? Well, it isn't. Oh, and you're right about non grouping group as well, that's all :? does. Here's the right code, with right explanation: //////////////////////////////////////////01122223334444444336667777777660 public static final String P_PRICE_NEW = "(-?\\d+(?:\\.\\d+)?(?:E-?\\d+)?)"; /* 0- This is the main group, which includes it all. 1- This allows the expression to either start or not start with a '-' 2- This forces a larger than 0 amount of digits to be present. 3- This creates a non-grouping group, which may or may not be, with 4 inside of it. 4- If group 3 is created, inside of it has to be a '.', and a positive count of digits. 6- Creates the final optional non grouping group, with contents of 7 inside of it. 7- If created has to have an E, which maybe followed by a -, and then has to be followed by a positive count of digits. */ This pattern also accepts above mentioned String, and produces the same output. But this one would filter away floats if they were to be written somehow like this: 12;45. It'll also accept a single digit float I believe, while previous pattern would probably require at least two digits, and an any char between them(a digit would also do). Append: String line="139.8E-2 124 -123.0 4"; Pattern p = Pattern.compile(P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW); Groups as intended. String line="139.8E-2 124 4.1.4 -123.0"; Pattern p = Pattern.compile(P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW+" "+P_PRICE_NEW); This doesn't find. By the way, I always use a find rather than a match. So unless the 4.1.4 was moved inside of it, the pattern was also found & grouped, just dumping the redundant .4 after 4.1.
|
bamccaig
Member #7,536
July 2006
|
That looks good to me. Nevertheless, it's complicated enough that I recommend you setup a test program that feeds various good and bad inputs to it and verifies that it matches what it should and doesn't match what it shouldn't. That's really the best way to deal with regular expressions. Test them with enough data that if there's a bug it will get caught. I also like your indexing scheme. I'm not sure if Java supports something like Perl's /x, but if not I think you've come up with a satisfactory way to document a regex otherwise. Append: Something like 4.1.4 shows you just how complicated precise regular expressions need to be. It's easy to match the basic cases. It's hard to match all of the good and none of the bad. Something like 4.1.4 might be solved with a negative lookahead added to the end. Something like: (?!>\.\d*). Or it might suffice to require either white-space or the end of the line instead after the number: (?:\s+|$). Ultimately it all comes down to the business requirements. What input is possible and how you want to handle invalid input. In any case, I think you've got a good handle on things now. -- acc.js | al4anim - Allegro 4 Animation library | Allegro 5 VS/NuGet Guide | Allegro.cc Mockup | Allegro.cc <code> Tag | Allegro 4 Timer Example (w/ Semaphores) | Allegro 5 "Winpkg" (MSVC readme) | Bambot | Blog | C++ STL Container Flowchart | Castopulence Software | Check Return Values | Derail? | Is This A Discussion? Flow Chart | Filesystem Hierarchy Standard | Clean Code Talks - Global State and Singletons | How To Use Header Files | GNU/Linux (Debian, Fedora, Gentoo) | rot (rot13, rot47, rotN) | Streaming |
type568
Member #8,381
March 2007
|
Again thanks for your suggestion, I don't think I'll bother though. It matches what it has to match(couple thousands lines are parsed just fine), and if it misses something it shouldn't let in.. Whatever. Generally it should never be fed anything wrong anyways, and all it does if it's fed a not matching string is a crash. But the only strings are feed to it, are those produced with its .toString(). Generally I overshot the regex objective anyways, but it's good to learn something new. About it missing the final 4.1.4, well. If I would want it NOT to let this happen, i wouldn't use .find() on a matcher, but I would use a .match() instead. But actually should I decide to append a comment to a line, well.. Why not. I also am not willing to mess with the various newlines which encounter here and there and complicate the day. So .find() is good enough for me. Now about data though. I tested this thing manually of course, but: do you mean I should feed it with something massive & generic in order to comply with today coding standards, perhaps read from some file, or.. How would you suggest it done? Two more things: Due to the fact all data matches, I do believe it works sufficiently well, even if it is possible to crash it, or cheat it if one messes with the files. If you ruin the data you can crash the program.. Fine for me, is there a program won't crash then? Edit: readability.
|
|