Discussion:
Valid characters in a module name
Ess Kay
2017-01-01 01:43:28 UTC
Permalink
"JPMS: Modules in the Java Language and JVM" at
http://cr.openjdk.java.net/~mr/jigsaw/spec/lang-vm.html#jigsaw-2.6 says
that a module name can contain almost any character. If this is indeed the
case then it is going to make the development of scripting languages that
specify module names quite complicated.

Would it be too big a compromise to specify that a module name must not
contain
1) double quotes,
2) single quotes,
3) spaces,
4) forward slashes or
5) asterixes?
Put another way, what advantage is there in allowing a module name to start
with or contain a double quote or space etc?

Taking a step backwards, would it be too much of a compromise to specify
that module names should only contain characters that are valid in a Java
class name perhaps with the exception of a few special characters?
Remi Forax
2017-01-01 13:17:29 UTC
Permalink
Hi,

Just to be sure, you are talking about module name for the VM and not module name for Java the language,
as usual, module name for Java are more restricted than module name that can be generated by something which is not javac.

The section 2.1 contains more restrictions on module names that the one you have quoted,
http://cr.openjdk.java.net/~mr/jigsaw/spec/lang-vm.html#jigsaw-2.1
so spaces are not allowed.

The idea is that if we want an existing module system not necessarily defined in Java to be mapped to the JPMS environment, there is no reason to add arbitrary rules like the ones you propose for quotes.
For spaces, i fully agree with you, debugging a module configuration that allows ' ' or '\t' in module name will be awful, that why they are not allowed.

regards,
Rémi

----- Mail original -----
Envoyé: Dimanche 1 Janvier 2017 02:43:28
Objet: Valid characters in a module name
"JPMS: Modules in the Java Language and JVM" at
http://cr.openjdk.java.net/~mr/jigsaw/spec/lang-vm.html#jigsaw-2.6 says
that a module name can contain almost any character. If this is indeed the
case then it is going to make the development of scripting languages that
specify module names quite complicated.
Would it be too big a compromise to specify that a module name must not
contain
1) double quotes,
2) single quotes,
3) spaces,
4) forward slashes or
5) asterixes?
Put another way, what advantage is there in allowing a module name to start
with or contain a double quote or space etc?
Taking a step backwards, would it be too much of a compromise to specify
that module names should only contain characters that are valid in a Java
class name perhaps with the exception of a few special characters?
Ess Kay
2017-01-02 01:44:36 UTC
Permalink
Hi Rémi,
You can update your tool to use an escape character
Sure. However, can you imagine how much work it would be to update a Java
source parser to allow identifiers like package and class names to contain
escaped semi-colons, single quotes or double quotes? My scenario and that
of many others is the same. It can be done but the result will be ugly.

I repeat my earlier question, are there existing module systems out there
that allow spaces, quotes and semi colons to appear in a module name?

All I ask is that serious thought be given to how much flexibility is
really needed in a module name. There are signs that there has not yet
been much serious thought. For example backspace is not allowed but DEL
(0x127) is allowed.

Best Regards,
Alan Bateman
2017-01-02 07:29:39 UTC
Permalink
Post by Ess Kay
Sure. However, can you imagine how much work it would be to update a Java
source parser to allow identifiers like package and class names to contain
escaped semi-colons, single quotes or double quotes? My scenario and that
of many others is the same. It can be done but the result will be ugly.
Can you clarify if your questions relate to module names in source files
or in the binary form (in a module-info.class file)?
Post by Ess Kay
I repeat my earlier question, are there existing module systems out there
that allow spaces, quotes and semi colons to appear in a module name?
All I ask is that serious thought be given to how much flexibility is
really needed in a module name. There are signs that there has not yet
been much serious thought. For example backspace is not allowed but DEL
(0x127) is allowed.
The #ModuleNameCharacters thread on the jpms-spec-experts list [1] has
the discussions and proposal on this topic.

-Alan

[1] http://mail.openjdk.java.net/pipermail/jpms-spec-experts/
David M. Lloyd
2017-01-03 16:04:48 UTC
Permalink
Post by Ess Kay
Hi Rémi,
You can update your tool to use an escape character
Sure. However, can you imagine how much work it would be to update a Java
source parser to allow identifiers like package and class names to contain
escaped semi-colons, single quotes or double quotes? My scenario and that
of many others is the same. It can be done but the result will be ugly.
I repeat my earlier question, are there existing module systems out there
that allow spaces, quotes and semi colons to appear in a module name?
Yes. Java EE and JBoss Modules both allow this, as do systems where a
file name is a module name.
Post by Ess Kay
All I ask is that serious thought be given to how much flexibility is
really needed in a module name. There are signs that there has not yet
been much serious thought. For example backspace is not allowed but DEL
(0x127) is allowed.
There has been plenty of serious thought, and I agree that we should be
disallowing all Unicode controls of any kind, but my understanding is
that there are implementation complexities involved which make this
somehow impractical. However UTF-8 parsing is not difficult so
hopefully this can be revisited at some point.
--
- DML
Ess Kay
2017-01-03 21:38:12 UTC
Permalink
Post by David M. Lloyd
Java EE and JBoss Modules both allow this
Two points. Firstly there is a big difference between a character being
allowed and the character actually being used in practice. Are you saying
that in practice anyone anywhere is putting spaces, single quotes or double
quotes within a Java EE and JBoss module name? If the answer is yes then
how common would that be?

Also, do Java EE and JBoss module names currently allow the characters in
the range 0x00 to 0x1F? If the answer is yes then 100% compatibility with
Java 9 module names is already gone.

The second point is that we are now talking about Java 9 module names being
embedded as identifiers within Java class files where they will directly
affect downstream users. This was not the case with Java EE and JBoss
Module names. This is a much, much bigger deal.
Post by David M. Lloyd
There has been plenty of serious thought
In the jpms-spec-experts list it is suggested that "for sanity" Java 9
module names should not contain "any character whose Unicode code point is
less than 0x20". Yet the DEL (0x127) character is allowed?

Taking a step backwards, it would appear that it was never considered that
Java 9 module names might need be be specified as identifiers in existing
bytecode processing scripting languages. That is 100% understandable.
However, now that it is known that that is the case, doesn't it make sense
"for sanity" to not allow a Java 9 module name to contain spaces, single
quotes, double quotes, semicolons or asterixes? The problem with spaces,
single quotes, double quotes in an identifier that needs to be parsed from
a text file is obvious. As mentioned earlier, the problem with semicolons
is that they are commonly used as terminators in the scripting languages
which nearly always use a Java style syntax. The problem with asterixes is
that they are commonly used as a wildcard character in identifiers in the
languages.
Post by David M. Lloyd
Post by Ess Kay
Hi Rémi,
You can update your tool to use an escape character
Sure. However, can you imagine how much work it would be to update a Java
source parser to allow identifiers like package and class names to contain
escaped semi-colons, single quotes or double quotes? My scenario and that
of many others is the same. It can be done but the result will be ugly.
I repeat my earlier question, are there existing module systems out there
that allow spaces, quotes and semi colons to appear in a module name?
Yes. Java EE and JBoss Modules both allow this, as do systems where a
file name is a module name.
All I ask is that serious thought be given to how much flexibility is
Post by Ess Kay
really needed in a module name. There are signs that there has not yet
been much serious thought. For example backspace is not allowed but DEL
(0x127) is allowed.
There has been plenty of serious thought, and I agree that we should be
disallowing all Unicode controls of any kind, but my understanding is that
there are implementation complexities involved which make this somehow
impractical. However UTF-8 parsing is not difficult so hopefully this can
be revisited at some point.
--
- DML
David M. Lloyd
2017-01-03 22:13:19 UTC
Permalink
Post by Ess Kay
Post by David M. Lloyd
Java EE and JBoss Modules both allow this
Two points. Firstly there is a big difference between a character being
allowed and the character actually being used in practice. Are you
saying that in practice anyone anywhere is putting spaces, single quotes
or double quotes within a Java EE and JBoss module name? If the answer
is yes then how common would that be?
I don't have metrics, of course; one only needs to know the API contract
to know that this is allowed. But it is completely irrelevant to the
discussion of the requirement at any rate because you're confusing a few
things which I'll outline down below.
Post by Ess Kay
Also, do Java EE and JBoss module names currently allow the characters
in the range 0x00 to 0x1F? If the answer is yes then 100% compatibility
with Java 9 module names is already gone.
Yes it's allowed (not 0x00 but the others are), but I don't think there
is any practical way to actually accomplish injecting most of those
values (at least in a Java EE situation), nor have I ever seen it happen
in practice.
Post by Ess Kay
The second point is that we are now talking about Java 9 module names
being embedded as identifiers within Java class files where they will
directly affect downstream users. This was not the case with Java EE
and JBoss Module names. This is a much, much bigger deal.
No, that's not how it works at all. Java source modules can only
reference other Java source modules; nobody is going to be shocked that
you can't reference a manually-built module from a Java source module.
There's nothing to affect downstream users. Only container code will be
creating or referencing such modules.
Post by Ess Kay
Post by David M. Lloyd
There has been plenty of serious thought
In the jpms-spec-experts list it is suggested that "for sanity" Java 9
module names should not contain "any character whose Unicode code point
is less than 0x20". Yet the DEL (0x127) character is allowed?
Taking a step backwards, it would appear that it was never considered
that Java 9 module names might need be be specified as identifiers in
existing bytecode processing scripting languages. That is 100%
understandable. However, now that it is known that that is the case,
doesn't it make sense "for sanity" to not allow a Java 9 module name to
contain spaces, single quotes, double quotes, semicolons or asterixes?
From the perspective of container code, those "sanity" characters are
selected completely arbitrarily. They have various usages in various
contexts, but any given module layer implementation doesn't necessarily
align with any of those contexts. For example there may be different
characters which don't make sense, while some of the examples you listed
do. So it makes the most sense to allow anything at the bytecode level,
and rely on the module layer implementation to apply the appropriate policy.
Post by Ess Kay
The problem with spaces, single quotes, double quotes in an identifier
that needs to be parsed from a text file is obvious.
Sure, but the lenient rules only apply to class files that were manually
generated. There are zero cases where one would have to parse an
arbitrary module identifier from a text file; every module layer
provider is going to have their own syntax and naming policy.
Post by Ess Kay
As mentioned
earlier, the problem with semicolons is that they are commonly used as
terminators in the scripting languages which nearly always use a Java
style syntax. The problem with asterixes is that they are commonly used
as a wildcard character in identifiers in the languages.
In the language, the module identifier is bounded by quite strict
syntax. Modules which are distributed for downstream consumption will
likewise adhere to these criteria (they have no way to avoid it short of
weird bytecode hacking). The only time the more general rules come into
play is when modules are being generated at run time from other module
systems and setups.

Let me put it another way.

Every module system has its own rules and restrictions for how a module
can be named. Those restrictions do not all exactly align. If you ban
the union of all the disallowed characters in all module systems, then
all module systems will break. However if you only ban the intersection
of such systems (i.e. control characters), and allow each layer to
enforce its own policy, then every system will work and interoperate as
expected. There is no downside because anyone who "cleverly" hacks
bytecode to produce a vanilla module with an invalid name will soon
realize that their module can never be found. A user has to go to
extraordinary measures to do so, so there is very little risk of such a
thing happening nor is there a risk of it impacting users in any
relevant way.

Everyone has their own notion of what "offensive" characters would be.
But enforcing these rules is and can only be the job of the layer provider.
Post by Ess Kay
Hi Rémi,
You can update your tool to use an escape character
Sure. However, can you imagine how much work it would be to update a Java
source parser to allow identifiers like package and class names to contain
escaped semi-colons, single quotes or double quotes? My scenario and that
of many others is the same. It can be done but the result will be ugly.
I repeat my earlier question, are there existing module systems out there
that allow spaces, quotes and semi colons to appear in a module name?
Yes. Java EE and JBoss Modules both allow this, as do systems where
a file name is a module name.
All I ask is that serious thought be given to how much
flexibility is
really needed in a module name. There are signs that there has not yet
been much serious thought. For example backspace is not allowed but DEL
(0x127) is allowed.
There has been plenty of serious thought, and I agree that we should
be disallowing all Unicode controls of any kind, but my
understanding is that there are implementation complexities involved
which make this somehow impractical. However UTF-8 parsing is not
difficult so hopefully this can be revisited at some point.
--
- DML
--
- DML
Ess Kay
2017-01-05 08:22:27 UTC
Permalink
No, that's not how it works at all ... Only container code will be
creating or referencing such modules
This may be true for JBoss & JEE modules. However, the very reason I
raised the initial question was because it is NOT true for identifiers
embedded in Java bytecode. There are utilities out there now that
manipulate bytecode that are driven by script files that specify Java
identifiers using a Java-style syntax. I raised the initial question
because I have the job of updating such a utility to support Java 9. In
part that entails parsing user specified module names out of a script file.
If we now need to support quotes and spaces etc in a module identifier then
it is going to be a problem.
In the language, the module identifier is bounded by quite strict syntax
If you are a downstream user processing bytecode files rather than source
code then that doesn't help. You technically need to support the the broad
character range specified by the JVM specification even though you know
that the chance that someone will put a space or quote etc in a module name
is vanishingly small.
There are zero cases where one would have to parse an arbitrary module
identifier from a text file
As set out above that is not the case. I'll give you a another perhaps more
mainstream example. I imagine that there will be times when errors
reporting module issues (e.g. missing modules?) will be written to a log
file. It is not uncommon for other applications to read such log files and
process the contents. Technically the applications reading such log files
should, in their parsing, allow for module names containing spaces and
quotes etc because that is what the JVM specification says is possible.
Yes it's allowed (not 0x00 but the others are), but I don't think there
is any practical way to actually accomplish injecting most of those values
Again, as set out above, it doesn't matter if anyone actually put spaces
and quotes etc in a module name. If you are going to be guided by the JVM
Spec then sadly you to need to allow for them.
"sanity" characters are selected completely arbitrarily
Exactly. That is what makes it so frustrating. I have little doubt that if
the people who decided the range of excluded characters had to write the
code to parse out module names from text files then we would have a few
extra characters excluded and no existing module system would be
compromised.
Alan Bateman
2017-01-05 11:59:55 UTC
Permalink
Post by David M. Lloyd
No, that's not how it works at all ... Only container code will be
creating or referencing such modules
This may be true for JBoss & JEE modules.
I see "Java EE modules" have been mentioned a few times in this thread.
Aside from the word "module" then I'm not sure that is is relevant to
the discussion because the Java EE notion of modules is about naming or
locating a bundle of EE components. We of course hope that a future
version of Java EE will support modules but I think it's too early to
know if this will involve funky modules names or not.

In any case then my understanding of the thread so far is that the
configuration scripts for this tool don't currently allow class or other
names that have been legal in class files for a long time (12 years?).
If users aren't complaining them maybe keep the status quo with module
names so that the configuration scripts only support modules names that
are allowed in a module-info source file (Java identifiers separated by
'.'). The chances of meeting a module-info.class with funky module names
is low, at least for now. I realize that might not be completely
satisfactory but I assume you will need to provide a way to encode the
names of funky classes anyway some day, the problem is not specific to
module names.

-Alan
Ess Kay
2017-01-06 04:27:24 UTC
Permalink
chances of meeting a module-info.class with funky module names is low
When I raised the initial question, I had no idea that the Java verifier
had been changed (with Java 6?) to allow "funky" package, class, field and
method names. Somehow that change passed right under the radar. Yes - a
possible option would be to simply ignore the broad character range allowed
by the JVM specification and trust that in practice no one would actually
use the usual characters in package, class, field, method or module names.
A downside to that option is that we will no longer be able to say to our
users that we fully support the JVM specification which in some cases can
be a problem. Anyway, I guess it is time to accept the overwhelming inertia
of the status quo and move on to the next problem.
Peter Levart
2017-01-06 22:45:57 UTC
Permalink
Hi Ess,
Post by Ess Kay
chances of meeting a module-info.class with funky module names is low
When I raised the initial question, I had no idea that the Java verifier
had been changed (with Java 6?) to allow "funky" package, class, field and
method names. Somehow that change passed right under the radar. Yes - a
possible option would be to simply ignore the broad character range allowed
by the JVM specification and trust that in practice no one would actually
use the usual characters in package, class, field, method or module names.
A downside to that option is that we will no longer be able to say to our
users that we fully support the JVM specification which in some cases can
be a problem. Anyway, I guess it is time to accept the overwhelming inertia
of the status quo and move on to the next problem.
If I remember correctly, there was a crazy proposal in the past to
specify a syntax for arbitrary symbol names in Java. It went roughly like:

@"the syntax of Java string in here"


So you could write code like:


public class @"What a wonderful world!" {
public static void @"Let's party..."() {
}
}

//
@"What a wonderful world!".@"Let's party..."();


You could adopt this in your tool, what do you think?

Regards, Peter
Ess Kay
2017-01-07 00:40:49 UTC
Permalink
As far as I can tell, the complete string @"What a wonderful world!" is
itself a valid module, package, class, field and method name. The '@'
character has a reserved status in a module name but the JVM spec says that
it may appear with some yet to be published meaning. Almost every possible
string of printable characters is a valid module, package, class, field and
method name. For example, the string \u0022\" is a valid 8 character Java
field or method name. The string \\u0022\\" is a valid 10 character Java
field or method name. So a solution that uses escape characters is not as
obvious as it may appear at first glance.You could even throw in a leading,
embedded and trailing space and it would still be valid.

I haven't yet tested this but, prima facie, even non-printable characters
such as backspaces and carriage returns are permitted in package, class,
field and method names (but not module names.) Does the JVM support some
escaping scheme to allow such characters in JAR manifests and service
provider specifications? If the answer is yes then what is it? If the
answer is no then doesn't that demonstrates the absurdity of the situation?

So at this point Alan's suggested initial 'do nothing' approach is
attractive. At this point the flexibility that the JVM spec gives is
totally gratuitous in that no one as yet appears to have had any reason to
make use of it.
Post by Peter Levart
Hi Ess,
chances of meeting a module-info.class with funky module names is low
When I raised the initial question, I had no idea that the Java verifier
had been changed (with Java 6?) to allow "funky" package, class, field and
method names. Somehow that change passed right under the radar. Yes - a
possible option would be to simply ignore the broad character range allowed
by the JVM specification and trust that in practice no one would actually
use the usual characters in package, class, field, method or module names.
A downside to that option is that we will no longer be able to say to our
users that we fully support the JVM specification which in some cases can
be a problem. Anyway, I guess it is time to accept the overwhelming inertia
of the status quo and move on to the next problem.
If I remember correctly, there was a crazy proposal in the past to specify
@"the syntax of Java string in here"
}
}
//
@"What a wonderful world!".@"Let's party..."();
You could adopt this in your tool, what do you think?
Regards, Peter
Peter Levart
2017-01-07 06:44:04 UTC
Permalink
Hi Ess,

I have been reminded that the syntax for "Exotic identifiers" in Java
language as proposed for JDK 7 but then redrawn was using '#' character
as a prefix in front of a classical string literal:

http://mail.openjdk.java.net/pipermail/coin-dev/2009-March/001131.html

I accidentally replaced it with a syntax for Obj-C NSString literals
is itself a valid module, package, class, field and method name.
spec says that it may appear with some yet to be published meaning.
Almost every possible string of printable characters is a valid
module, package, class, field and method name. For example, the
string \u0022\" is a valid 8 character Java field or method name.
Written with exotic identifier syntax as:

#"\\u0022\\\""
The string \\u0022\\" is a valid 10 character Java field or method
name. So a solution that uses escape characters is not as obvious as
it may appear at first glance.You could even throw in a leading,
embedded and trailing space and it would still be valid.
No problem. A sequence of any unicode characters is expressible as a
string literal and consequently as an exotic identifier when prefixed
with #.
I haven't yet tested this but, prima facie, even non-printable
characters such as backspaces and carriage returns are permitted in
package, class, field and method names (but not module names.) Does
the JVM support some escaping scheme to allow such characters in JAR
manifests and service provider specifications? If the answer is yes
then what is it? If the answer is no then doesn't that demonstrates
the absurdity of the situation?
It appears that NUL, CR, and LF can't be part of header values in JAR
manifests, but other characters can:

http://docs.oracle.com/javase/7/docs/technotes/guides/jar/jar.html#Manifest_Specification

/Notes on Manifest and Signature Files//
//
// Line length://
// No line may be longer than 72 bytes (not characters), in its
UTF8-encoded form. If a value would make the initial line longer than
this, it should be continued on extra lines (each starting with a single
SPACE).//
//
// Limitations://
// Because header names cannot be continued, the maximum length
of a header name is 70 bytes (there must be a colon and a SPACE after
the name).//
// NUL, CR, and LF can't be embedded in header values, and NUL,
CR, LF and ":" can't be embedded in header names.//
// Implementations should support 65535-byte (not character)
header values, and 65535 headers per file. They might run out of memory,
but there should not be hard-coded limits below these values.//
/

Regards, Peter
So at this point Alan's suggested initial 'do nothing' approach is
attractive. At this point the flexibility that the JVM spec gives is
totally gratuitous in that no one as yet appears to have had any
reason to make use of it.
Hi Ess,
Post by Ess Kay
chances of meeting a module-info.class with funky module names is low
When I raised the initial question, I had no idea that the Java verifier
had been changed (with Java 6?) to allow "funky" package, class, field and
method names. Somehow that change passed right under the radar. Yes - a
possible option would be to simply ignore the broad character range allowed
by the JVM specification and trust that in practice no one would actually
use the usual characters in package, class, field, method or module names.
A downside to that option is that we will no longer be able to say to our
users that we fully support the JVM specification which in some cases can
be a problem. Anyway, I guess it is time to accept the overwhelming inertia
of the status quo and move on to the next problem.
If I remember correctly, there was a crazy proposal in the past to
@"the syntax of Java string in here"
}
}
//
@"What a wonderful world!".@"Let's party..."();
You could adopt this in your tool, what do you think?
Regards, Peter
Ess Kay
2017-01-07 08:34:01 UTC
Permalink
using the syntax for exotic identifiers
At the level of the JVM, the entire string #"@\"What a wonderful world!\""
is a valid 31 character package, class, method and field name including the
#, the \ and the quotes. So that syntax doesn't help.
appears that NUL, CR, and LF can't be part...
Prime facie, CR and LF characters are valid in Java package and class
names. To fully support the JVM Spec, the JVM Manifest processing classes
should provide an escaping scheme to allow such characters. However, it
cannot use the obvious '\' character because it too is allowed in Java
package and class names.
Nicolai Parlog
2017-01-07 15:51:31 UTC
Permalink
Hi Ess!
world!\"" is a valid 31 character package, class, method and field
name including the #, the \ and the quotes.
Yes but in your scripts it's not, right? I thought your problem was
that users needed a way to express "crazy identifiers" in _your_ (or
other Java-like) script languages. Why wouldn't Peter's syntax
proposal suffice to allow that?

so long ... Nicolai
using the syntax for exotic identifiers
world!\"" is a valid 31 character package, class, method and field
name including the #, the \ and the quotes. So that syntax doesn't
help.
appears that NUL, CR, and LF can't be part...
Prime facie, CR and LF characters are valid in Java package and
class names. To fully support the JVM Spec, the JVM Manifest
processing classes should provide an escaping scheme to allow such
characters. However, it cannot use the obvious '\' character
because it too is allowed in Java package and class names.
- --

PGP Key:
http://keys.gnupg.net/pks/lookup?op=vindex&search=0xCA3BAD2E9CCCD509

Web:
http://codefx.org
a blog about software development
https://www.sitepoint.com/java
high-quality Java/JVM content
http://do-foss.de
Free and Open Source Software for the City of Dortmund

Twitter:
https://twitter.com/nipafx
Ess Kay
2017-01-08 01:33:42 UTC
Permalink
Why wouldn't Peter's syntax proposal suffice
For any syntax to work (in ANY context) you need to be able to distinguish
between the use of that syntax and the specification of an identifier (e.g.
a module, package, class, field or method name) which happens to match that
syntax. Let's take Peter's string #"\\u0022\\\"" as an example. How is
that string to be interpreted? Is it an example of the proposed syntax that
should be interpreted as the 11 character identifier \\u0022\\\" or is it a
14 character identifier starting with # and ending with to double quotes?
Module, package, class, field and method names can legally start with the 2
characters #" and end in a double quote. That is why Peter's syntax would
not work. This is the difficulty when you specify that almost any
character and character combination is a valid in an identifier.
I thought your problem was that users needed a way to express
"crazy identifiers" in _your_ (or other Java-like) script languages.
My discussion of the problem of crazy identifiers in JAR manifests was
more of "parting shot". It is very easy to be ultra flexible in a
specification. It is very concise and even aesthetically pleasing.
However, it can be much, much harder to actually support that ultra
flexibility in the practice. Problems can occur in unexpected places. Such
is life.
Peter Levart
2017-01-08 09:18:29 UTC
Permalink
Hi Ess,
Post by Ess Kay
Why wouldn't Peter's syntax proposal suffice
For any syntax to work (in ANY context) you need to be able to distinguish
between the use of that syntax and the specification of an identifier (e.g.
a module, package, class, field or method name) which happens to match that
syntax. Let's take Peter's string #"\\u0022\\\"" as an example. How is
that string to be interpreted? Is it an example of the proposed syntax that
should be interpreted as the 11 character identifier \\u0022\\\" or is it a
14 character identifier starting with # and ending with to double quotes?
If this sequence of characters appear in source at position where
identifier is expected:

#"\\u0022\\\""

then they are interpreted as an identifier with following characters:

\u0022\"

This is unambiguous because otherwise the syntax of "plain" identifier
(as opposed to "exotic" identifier) doesn't allow it to start with
character #.

If parser encounters character # followed by double quote, it knows it
is a start of exotic identifier.
Post by Ess Kay
Module, package, class, field and method names can legally start with the 2
characters #" and end in a double quote. That is why Peter's syntax would
not work.
Why? If the name of identifier starts with #"... then such identifier
can only be expressed in the syntax of exotic identifiers, therefore you
have to write it in the source as:

#"#\"..."
Post by Ess Kay
This is the difficulty when you specify that almost any
character and character combination is a valid in an identifier.
I see no difficulty here.

Regards, Peter
Post by Ess Kay
I thought your problem was that users needed a way to express
"crazy identifiers" in _your_ (or other Java-like) script languages.
My discussion of the problem of crazy identifiers in JAR manifests was
more of "parting shot". It is very easy to be ultra flexible in a
specification. It is very concise and even aesthetically pleasing.
However, it can be much, much harder to actually support that ultra
flexibility in the practice. Problems can occur in unexpected places. Such
is life.
Ess Kay
2017-01-09 00:55:34 UTC
Permalink
Post by Peter Levart
If this sequence of characters appear in source at position where
#"\\u0022\\\""
\u0022\"
Then what happens when a user wants to specify the valid 14 character class
name #"\\u0022\\\"" ? Perhaps I am misunderstanding you? Do you accept
that according to the JVM specification a module, package, class, field or
method name in a Java class file can legally start with the two characters
#" and end with a single double quote? For an escape character sequence to
work it is essential that it is not otherwise legal in a particular
string. That is not the case with the #"..." sequence. I don't think there
is any character that is invalid across module, package, class, field and
method names. So there is no one character or character sequence that can
act as as a escape sequence across all names.
Peter Levart
2017-01-09 08:48:46 UTC
Permalink
Hi Ess,
Post by Ess Kay
Post by Peter Levart
If this sequence of characters appear in source at position where
#"\\u0022\\\""
\u0022\"
Then what happens when a user wants to specify the valid 14 character
class name #"\\u0022\\\"" ?
He would write it in source like:

#"#\"\\\\u0022\\\\\\\"\""

...this is hard to read, but doable.
Post by Ess Kay
Perhaps I am misunderstanding you? Do you accept that according to
the JVM specification a module, package, class, field or method name
in a Java class file can legally start with the two characters #" and
end with a single double quote?
By JVM specification, yes.
Post by Ess Kay
For an escape character sequence to work it is essential that it is
not otherwise legal in a particular string. That is not the case with
the #"..." sequence.
We are talking about identifiers, remember?

Normally in Java, an identifier can contain
(https://en.wikipedia.org/wiki/Java_syntax#Identifier):

Any Unicode character that is a letter (including numeric letters
like Roman numerals) or digit.
Currency sign (such as $).
Connecting punctuation character (such as _).

An identifier cannot:

Start with a digit.
Be equal to a reserved keyword, null literal or boolean literal.


Therefore, you can not start a Java (the language) identifier with
character #. The "exotic" identifiers syntax was devised to enable Java
(the language) to express any identifier that is otherwise possible by
JVM specification. Mainly to enable inter-operation between Java and
other JVM based languages that might use different rules as far as
identifiers are concerned. Because the proposal was redrawn, status quo
now is that if you want to be inter-operable with Java, you have to play
by Java rules at least in part where Java and any other language
inter-operate.
Post by Ess Kay
I don't think there is any character that is invalid across module,
package, class, field and method names. So there is no one character
or character sequence that can act as as a escape sequence across all
names.
I still don't see your problem. I showed that any sequence of characters
is expressible using exotic identifiers syntax, because it borrows from
the syntax of Java string literals. Even CR and LF are expressible with
\r and \n . I showed that exotic identifiers syntax is not ambiguous,
because normally, using plain identifiers syntax, identifiers can not
contain character # .


Regards, Peter
Ess Kay
2017-01-10 07:28:05 UTC
Permalink
Post by Peter Levart
#"#\"\\\\u0022\\\\\\\"\""
Of course the above is itself a valid 25 character module, package, class,
field or method name. It doesn't matter how deep you nest.
Peter Levart
2017-01-10 16:37:05 UTC
Permalink
Post by Ess Kay
Post by Peter Levart
#"#\"\\\\u0022\\\\\\\"\""
Of course the above is itself a valid 25 character module, package,
class, field or method name. It doesn't matter how deep you nest.
It is, but as it is written (starting with characters #" ) it can only
be parsed as an exotic identifier and not as 25 character identifier. No
ambiguity here.
Ess Kay
2017-01-11 08:28:47 UTC
Permalink
Post by Peter Levart
but as it is written (starting with characters #" ) it
can only be parsed as an exotic identifier
Peter, you have previously said "syntax of 'plain' identifier (as opposed
to 'exotic' identifier) doesn't allow it to start with character #". That
really confused me because in Java bytecode module, package, class, file or
method names definitely can indeed start with the characters #". Indeed
the original proposal at
http://mail.openjdk.java.net/pipermail/coin-dev/2009-March/001131.html does
rely on the fact that the #"..." syntax would not otherwise be allowable in
Java source code.

However, I can now see that your proposed syntax could work if the script
parser ALWAYS interpreted JUST the OUTER #"..." as a case of your special
syntax. So, if you wanted to specify in a script the field name #"x" then
you would have to write it as #"#\"x\"". Is that what you were meaning all
along? That or something along the same lines could work perhaps without
too much disruption to existing parsing code. I will need to explore it
further but thank you.

Michael Rasmussen
2017-01-05 12:47:12 UTC
Permalink
Post by Ess Kay
There are utilities out there now that
manipulate bytecode that are driven by script files that specify Java
identifiers using a Java-style syntax. I raised the initial question
because I have the job of updating such a utility to support Java 9. In
part that entails parsing user specified module names out of a script file.
If we now need to support quotes and spaces etc in a module identifier then
it is going to be a problem.
As already mentioned, this should only be a problem for generated modules,
meaning when not compiled from a module-info.java file (where the language
level rules applies).
Also, assuming your bytecode manipulator touches classes and members, it
shouldn't be any different from what you currently have in order to support
generated classes, where the characters you mention are perfectly legal for
class and member names.
Example of such a generated class with spaces and quotes:
https://gist.github.com/anonymous/e1b9971d3079575066dcf060327bb323

/Michael
Ess Kay
2017-01-02 08:38:35 UTC
Permalink
My questions relate to module names in the binary form in the
module-info.class file.

My scenario is a utility that reads Java bytecode files and performs
transformations on them as specified by the user in a text script file. In
the text script file the user currently specifies package, class, field and
method names (amongst other things). If the utility and others like it is
to support Java 9 then the user must be able to specify module names that
match those within module-info.class files. These names must then be
parsed from within the script. This is where the difficulties arise.

Typically, utilities which process bytecode use a Java like syntax in their
scripting languages for obvious reasons. So if a module name can start with
/* or // then existing comment parsing will needlessly disrupted. If a
module name can start with or contain double or single quotes then existing
String literal parsing will be needlessly disrupted. If a module name can
contain semi-colons then existing line termination parsing will disrupted.
Etc. etc.

Put another way, is there an existing module system in world today that
commonly uses spaces, single quotes, double quotes or semicolons in a
module identifier? If not then why should these characters be allowed
within a Java 9 module name? Providing flexibility for some future
hypothetical module system has its costs here and now for downstream users.
Alan Bateman
2017-01-02 08:55:32 UTC
Permalink
Post by Ess Kay
My questions relate to module names in the binary form in the
module-info.class file.
My scenario is a utility that reads Java bytecode files and performs
transformations on them as specified by the user in a text script
file. In the text script file the user currently specifies package,
class, field and method names (amongst other things). If the utility
and others like it is to support Java 9 then the user must be able to
specify module names that match those within module-info.class files.
These names must then be parsed from within the script. This is where
the difficulties arise.
How are class and other names encoded in the "text script file" today?
Just asking because they may contain characters from the entire Unicode
code space.

-Alan
Ess Kay
2017-01-02 23:28:55 UTC
Permalink
they may contain characters from the entire Unicode code space.
The Java 8 JVM spec says that class names can contain characters from the
entire Unicode code space "where not further constrained". The only
explicit restriction that is specified is that an unqualified name cannot
contain '.', ';', '[' or '/'.

However, historically, the JVM has been much more restrictive. If you put
a single quote or double quote or asterix in a package or class name then
the 1.4.2 JVM would report a "ClassFormatError: Illegal class name". Even
the Java 9 JVM will currently report this error if you are executing a
class compiled for Java 1.4.2. Earlier JVMs would not even allow fully
numeric package or class names.
How are class and other names encoded in the "text script file" today?
Given the restrictiveness of previous JVMs, package and class names have
been assumed to be in a format that would be accepted by a Java compiler.
That is no spaces, single quotes, double quotes or asterixes allowed.

In response to your questions, I have performed some tests and see that the
newer JVMs will allow spaces, single quotes, double quotes and asterixes
in package, class, field and method names provided the class file is in a
newer format. So the nightmare that is being proposed for module names
technically already exists for package, class, field and method names in
more recent class file versions.

Given the above, could I respectfully broaden my initial question to ask
whether the relatively recent decision to allow spaces, single quotes,
double quotes and asterixes in package, class, field and method names could
be reconsidered? I would suggest that it is most unlikely that anyone
would need to use such characters in that context. The alternative of
dealing with such characters in the scenario of scripting languages as I
have described would be quite ugly.
Continue reading on narkive:
Loading...