2015-08-27

Principia

hello_world ↦ “Hello, World!”

Futuristic programming, reasoning and proof language.

About the name

The name is a reference to Whitehead & Russel's Principia Mathemica. ‘Mathematica’ was already taken by another cool programming language, so I choose ‘Principia’. It fits, because this language is focused more on principles and less on mathematics. (Although I envision a fully-proven high performance computer algebra library to be one of the first applications).

Syntax

The syntax is minimalist and heavily uses the Unicode Standard. Please use something like compose keys to enter Unicode characters (here's my setup: XCompile).

The language is a union between a programming language and a proof language. The programming language consists of just two constructs, calls and closures. It has eager evaluation, first-class functions and multiple return values. The proof language has four constructs, because, axiom, therefore and proofs. Both these languages live in a space that is similar to a directed cyclic multi-graph.

Unifying these languages is a meta-language that turns these graphs into a linear syntax for consumption by text editors. The meta-language handles scoping and identifier binding. It should be noted that binding is separate from the programming language. The programming language has no concept of variables and binding (and therefore, none of the problems associated with it).

Identifiers

The syntax for identifiers conforms to Unicode 6.2.0 UAX31-C1 revision 17. This essentially means that anything that could be considered a sensible identifier in any language is recognized as such.

In addition to these identifiers there are symbol identifiers such as + or ≥. These are defined by the Unicode Pattern Syntax. The parser will consider them identifier on their own, so ab+cd is a sequence of three identifiers: ab, + and bc. Currently the parser treats + and other symbolic identifiers as any identifier.

TODO: have some sort of infix notation parsing.

Statements

The language consists of a list of statements, each followed by a newline (or more precisely, a paragraph separator). Statements are of the general form:

outbound₁ outbound₂ … statement_kind inbound₁ inbound₂ …

Where outbound are identifiers and inbound can be both identifiers, literals and expressions. The statement_kind is one of six specific symbols:

≔ call
↦ closure
∵ because
∴ therefore
⊨ axiom
⊢ proofs

These are discussed further down. Since these are keywords, they are excluded from the list of usable identifier symbols. They are the only keywords in the language.

String literals

String literals start with a double opening quote (“, U+201C) and are closed with a double closing quote (”, U+201D). Note that the dumb symmetrical typewriter quotes " you find on your keyboard have no role.

The parser keeps track of the number of open quotes within the literal and only stops the literal when the opening double quote is closed. This has the advantage that to quote a piece of code, you simply enclose it in double quotes:

hello_world_source ↦ “
	hello_world ↦ “Hello, World!”
”

There are no escape sequences or ‘here-doc’ syntax. Any Unicode text that has balanced double quotes can be turned into a valid string literal by quoting it. This is basically any sensible string. Strings with unbalanced double quotes can not be literals. They can be create it programmatically.

Numeric constants

Numeric constants are written with mantissa, optional base and optional signed exponent. If no base is specified it is assumed decimal.

57 885 161
5F3759DF₁₆
10 0110 0110 0101₂
1₂⁶⁴
6.626 069 57₁₀⁻³⁴

The mantissa digits are taken from 0—9 and A—Z (uppercase, because Unicode does not recognize lowercase digits). This means the maximum base is 36. The radix point is given by a full stop dot (., U+002E). Thin spaces ( , U+2009) can be used to group digits in the mantissa.

The base and exponent are written in decimal using ₀—₉ and ⁰—⁹ respectively. The exponent can have optional signs ⁺ or ⁻. Unicode codepoints U+2080—U+2089, U+2070—U+2079 and U+207A, U+207B respectively.

Scoped statements (advanced)

Statements can be scoped by

scoping_statement₁
	statement₁
	statement₂
	scoping_statement₂
		statement₃
		statement₄
			⋱
	statement₅
	⋮

TODO: https://en.wikipedia.org/wiki/Scope_resolution_operator

… (out₁ · out₃  … statement_kind (a ↦ b) in₂ …) (≔ f out₃) out₁ …

… dummy d2 out₁ …
	out₁ dummy out₃  … statement_kind tmp in₂ …
		tmp a ↦ b
	d2 ≔ f out₃

func out₁ out₂ ↦ in₁ in₂
	<body>

→func
func out₁ out₂ ↦ in₁ in₂
	→out₁
	→out₂
	<body>
	←in₁
	←in₂



out₁ out₂ ≔ func in₁ in₂
	<body>

out₁ out₂ ≔ func in₁ in₂
	<body>
	←func
	←in₁
	←in₂
→out₁
→out₂

If statement_kind₁ is a ↦, out₁ is visible to the root scope, all are visible in the nested scope. in₁ and in₂ are resolved at the end of the nested scope.
If statement_kind₁ is a ≔, out₁ is visible to the root scope

Resolution:

Look up in the scope
Look down in the scope
Go up one scope
Repeat

Inline statements (advanced)

A statement enclosed in parentheses can be used in place of an inbound. The syntax is:

… (out₁ · out₃  … statement_kind in₁ in₂ …) …

Where the middle dot (·, U+00B7) is used as a placeholder to specify which inbound or outbound represents the expression in the parent expression.

If there is no middle dot, then it is implicitly prepended, so

… (out₂ … statement_kind in₁ in₂ …) …

is equivalent to

… (· out₂ … statement_kind in₁ in₂ …) …

For the purposes of binding identifiers, the inline statement is considered to be in the scope of the parent statement.

Binding rules (advanced)

To be written

Generally tries to be intuitive, with both forward and backward searching.

Syntax extensions (proposal)

Similar to Coq's Notation mechanism

[ (x) , … , (y) ]

is parsed as:

s₁ s₂ x s₃ s₄ … s₄ s₂ y s₃ s₅

with

s₁ → [
s₂ → (
s₃ → )
s₄ → ,
s₅ → ]

Expressions of length 0 … n are parsed as:

s₁ s₅
s₁ s₂ x₁ s₃ s₅
s₁ s₂ x₁ s₃ s₄ s₂ x₂ s₃ s₅
s₁ s₂ x₁ s₃ s₄ s₂ x₂ s₃ s₄ s₂ x₃ s₃ s₅
⋮

A substitution can be defined as

(≔ cons x … (≔ cons y nil) … )

Todo: Remove ( · ≔ ) and ( · ↦ ) notation from the formal language and reimplemented it as a notation extension.

( o₂ … oᵢ ≔ f i₁ … iᵢ )

which is to be replaced by t and the following statement added:

t o₂ … oᵢ ≔ f i₁ … iᵢ

The case for ↦ is equivalent. If we can capture ≔ and ↦ in a variable we can even do it like this:

( x₂ … xᵢ )

to be replaced by t with the following statement added:

t x₂ … xᵢ

And similarly for the variation with ·:

( x₁ … xᵢ · y₁ … y₂ )

x₁ … xᵢ t y₁ … y₂

This version neatly captures the variation where an incoming variable is captured.

While this is already quite powerful, likely enough to define a . Al these examples replace a pattern by a single temporary variable. Is there a use-case for patterns that get replaced

Ideas

Things to express:

https://en.wikipedia.org/wiki/Yoneda_lemma#The_Yoneda_embedding
https://en.wikipedia.org/wiki/Continuation
https://en.wikipedia.org/wiki/Delimited_continuation
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.41.125&rep=rep1&type=pdf

Programming language

Call statement

return₁ return₂ … ≔ function argument₁ argument₂ …

The call statement calls the function in inbound position one with the remaining inbounds as arguments. The return values are assigned to the outbounds.

If the arity of the closure mismatches the call, the arguments and retun values are padded with (↦).

Closure statement

function argument₁ argument₂ … ↦ return₁ return₂ …

The closure statement is perfectly symmetrical to the call statement. Now, the function and it's arguments are outbound and the return values are inbound.

On eagerness

The language is eager

Object oriented programming and other conventions

Scott encoding

Proof language

To be written

Because (pre-conditions)

∵ proposition

Therefore (post-conditions)

∴ proposition

Axiom

⊨ proposition

Proofs (assertions)

⊢ proposition

Implementation

C++11 (Clang)
Queχ (only if you regenerate the parser)
LLVM (MCJIT)

The Queχ based parser produces an abstract syntax tree (AST). The binder then binds all the identifiers in the AST. From this the program data flow graph (DFG) is created. The DFG is the natural representation of the program and is where static verification and optimizations like constant propagation happen.

From this DFG the (nested) closures are extracted and lambda lifted. The lifted closures are then serialized and converted to a simple stack machine language. The stack machine language is compiled to LLVM intermediate representation (IR). The IR run through optimization passes and JIT compiled to native machine code.

How to build and run

git clone git@github.com:Recmo/Principia.git
cd Principia
make -j4
./Principia Ackermann.txt PRA 3 3

Appendix: Recommended compose bindings

For programming in this Unicode heavy language I use XCompose. My compose key is caps lock which is otherwise disabled.

Symbol Key sequence

↦ caps lock | → ≔ caps lock : = · caps lock c .

Appendix: Unicode symmetric pairs

https://en.wikipedia.org/wiki/Bracket

( ) [ ] { } < > ‘ ’ “ ” ⸤ ⸥ ｢｣ ⟦ ⟧ ⟨ ⟩ 【】 ⟦ ⟧ ⟨ ⟩ ⟪ ⟫ ⟬ ⟭ ⟮ ⟯ ⦃ ⦄ ⦅ ⦆ ⦇ ⦈ ⦉ ⦊ ⦋ ⦌ ⦍ ⦎ ⦏ ⦐ ⦑ ⦒ ⦓ ⦔ ⦕ ⦖ ⦗ ⦘ ⌈ ⌉ ⌊ ⌋ ⌌ ⌍ ⌎ ⌏ ⌜ ⌝ ⌞ ⌟

Experiment: strict CPS

≔ function argument₁ argument₂ …

function argument₁ argument₂ … ↦

pyth x y ↦ r
	x2 ≔ mul x x
	y2 ≔ mul y y
	r2 ≔ add x2 y2
	r ≔ sqrt r2

pyth x y r ↦
	≔ mul x x (x2 ↦)
	≔ mul y y (y2 ↦)
	≔ add x2 y2 (r2 ↦)
	≔ sqrt r2 r